跳到主要内容

2025-05-16-12-04

A Multimodal Multi-Agent Framework for Radiology Report Generation

Abstract

arXiv:2505.09787v1 Announce Type: new Abstract: Radiology report generation (RRG) aims to automatically produce diagnostic reports from medical images, with the potential to enhance clinical workflows and reduce radiologists' workload. While recent approaches leveraging multimodal large language models (MLLMs) and retrieval-augmented generation (RAG) have achieved strong results, they continue to face challenges such as factual inconsistency, hallucination, and cross-modal misalignment. We propose a multimodal multi-agent framework for RRG that aligns with the stepwise clinical reasoning workflow, where task-specific agents handle retrieval, draft generation, visual analysis, refinement, and synthesis. Experimental results demonstrate that our approach outperforms a strong baseline in both automatic metrics and LLM-based evaluations, producing more accurate, structured, and interpretable reports. This work highlights the potential of clinically aligned multi-agent frameworks to support explainable and trustworthy clinical AI applications.

摘要

放射学报告生成(RRG)旨在通过医学影像自动生成诊断报告,具有优化临床工作流程和减轻放射科医生工作负荷的潜力。尽管当前基于多模态大语言模型(MLLMs)和检索增强生成(RAG)的方法已取得显著成果,但仍面临事实不一致、幻觉生成及跨模态失准等挑战。本研究提出一种符合临床分步推理流程的多模态多智能体框架,通过任务专用智能体分别处理检索、草稿生成、视觉分析、精炼与合成等环节。实验结果表明,该方法在自动指标和大语言模型评估中均优于强基线模型,能生成更准确、结构化且可解释的报告。本工作揭示了临床导向的多智能体框架在支持可解释、可信赖临床人工智能应用方面的潜力。


Demystifying AI Agents: The Final Generation of Intelligence

Abstract

arXiv:2505.09932v1 Announce Type: new Abstract: The trajectory of artificial intelligence (AI) has been one of relentless acceleration, evolving from rudimentary rule-based systems to sophisticated, autonomous agents capable of complex reasoning and interaction. This whitepaper chronicles this remarkable journey, charting the key technological milestones--advancements in prompting, training methodologies, hardware capabilities, and architectural innovations--that have converged to create the AI agents of today. We argue that these agents, exemplified by systems like OpenAI's ChatGPT with plugins and xAI's Grok, represent a culminating phase in AI development, potentially constituting the "final generation" of intelligence as we currently conceive it. We explore the capabilities and underlying technologies of these agents, grounded in practical examples, while also examining the profound societal implications and the unprecedented pace of progress that suggests intelligence is now doubling approximately every six months. The paper concludes by underscoring the critical need for wisdom and foresight in navigating the opportunities and challenges presented by this powerful new era of intelligence.

摘要

人工智能(AI)的发展轨迹始终呈现加速态势,已从基于简单规则的系统演变为具备复杂推理与交互能力的自主智能体。本白皮书系统梳理了这一演进历程,重点分析了促成当代AI智能体的关键技术里程碑——包括提示工程、训练方法、硬件能力及架构创新等领域的突破性进展。我们认为,以OpenAI插件版ChatGPT和xAI的Grok为代表的智能体,标志着AI发展可能已进入终极阶段,或将成为当前认知框架下的"最终代际"智能形态。通过具体案例,我们深入探讨了这些智能体的核心能力与技术基础,同时剖析了其带来的深远社会影响。研究指出,智能水平正以约每六个月翻倍的速度跃进,这种前所未有的发展速度要求我们以高度的智慧与远见来应对这一强大智能新时代所带来的机遇与挑战。


Unlocking Location Intelligence: A Survey from Deep Learning to The LLM Era

Abstract

arXiv:2505.09651v1 Announce Type: new Abstract: Location Intelligence (LI), the science of transforming location-centric geospatial data into actionable knowledge, has become a cornerstone of modern spatial decision-making. The rapid evolution of Geospatial Representation Learning is fundamentally reshaping LI development through two successive technological revolutions: the deep learning breakthrough and the emerging large language model (LLM) paradigm. While deep neural networks (DNNs) have demonstrated remarkable success in automated feature extraction from structured geospatial data (e.g., satellite imagery, GPS trajectories), the recent integration of LLMs introduces transformative capabilities for cross-modal geospatial reasoning and unstructured geo-textual data processing. This survey presents a comprehensive review of geospatial representation learning across both technological eras, organizing them into a structured taxonomy based on the complete pipeline comprising: (1) data perspective, (2) methodological perspective and (3) application perspective. We also highlight current advancements, discuss existing limitations, and propose potential future research directions in the LLM era. This work offers a thorough exploration of the field and providing a roadmap for further innovation in LI. The summary of the up-to-date paper list can be found in https://github.com/CityMind-Lab/Awesome-Location-Intelligence and will undergo continuous updates.

摘要

位置智能(LI)作为将基于位置的地理空间数据转化为可操作知识的科学,已成为现代空间决策的基石。地理空间表征学习的快速发展正通过两次连续的技术革命从根本上重塑LI的发展:深度学习突破与新兴的大语言模型(LLM)范式。尽管深度神经网络(DNN)在从结构化地理空间数据(如卫星影像、GPS轨迹)中自动提取特征方面表现出显著成效,但LLM的近期整合为跨模态地理空间推理和非结构化地理文本数据处理带来了变革性能力。本文综述全面审视了两个技术时代下的地理空间表征学习,基于包含以下环节的完整流程构建了结构化分类体系:(1) 数据视角,(2) 方法视角,(3) 应用视角。我们同时强调了当前进展,讨论了现存局限,并提出了LLM时代潜在的未来研究方向。这项工作提供了对该领域的深入探索,并为LI的进一步创新绘制了路线图。


AI Greenferencing: Routing AI Inferencing to Green Modular Data Centers with Heron

Abstract

arXiv:2505.09989v1 Announce Type: new Abstract: AI power demand is growing unprecedentedly thanks to the high power density of AI compute and the emerging inferencing workload. On the supply side, abundant wind power is waiting for grid access in interconnection queues. In this light, this paper argues bringing AI workload to modular compute clusters co-located in wind farms. Our deployment right-sizing strategy makes it economically viable to deploy more than 6 million high-end GPUs today that could consume cheap, green power at its source. We built Heron, a cross-site software router, that could efficiently leverage the complementarity of power generation across wind farms by routing AI inferencing workload around power drops. Using 1-week ofcoding and conversation production traces from Azure and (real) variable wind power traces, we show how Heron improves aggregate goodput of AI compute by up to 80% compared to the state-of-the-art.

摘要

由于AI计算的高功率密度和新兴推理工作负载,其电力需求正经历前所未有的增长。在供应端,大量风电资源正等待通过互联队列接入电网。基于此,本文提出将AI工作负载部署于风电场的模块化计算集群中。我们的部署规模优化策略使得当前部署超过600万块高端GPU具有经济可行性,这些GPU可直接利用廉价、绿色的源头电力。我们开发了Heron——一个跨站点软件路由器,它能够通过根据电力波动动态调度AI推理任务,高效利用不同风电场间的发电互补性。基于Azure为期一周的编码与对话生产轨迹及(真实)可变风电数据,我们证明相较于现有最优方案,Heron能将AI计算的聚合有效吞吐量提升最高达80%。


Pre-Act: Multi-Step Planning and Reasoning Improves Acting in LLM Agents

Abstract

arXiv:2505.09970v1 Announce Type: new Abstract: The ReAct (Reasoning + Action) capability in large language models (LLMs) has become the foundation of modern agentic systems. Recent LLMs, such as DeepSeek-R1 and OpenAI o1/o3, exemplify this by emphasizing reasoning through the generation of ample intermediate tokens, which help build a strong premise before producing the final output tokens. In this paper, we introduce Pre-Act, a novel approach that enhances the agent's performance by creating a multi-step execution plan along with the detailed reasoning for the given user input. This plan incrementally incorporates previous steps and tool outputs, refining itself after each step execution until the final response is obtained. Our approach is applicable to both conversational and non-conversational agents. To measure the performance of task-oriented agents comprehensively, we propose a two-level evaluation framework: (1) turn level and (2) end-to-end. Our turn-level evaluation, averaged across five models, shows that our approach, Pre-Act, outperforms ReAct by 70% in Action Recall on the Almita dataset. While this approach is effective for larger models, smaller models crucial for practical applications, where latency and cost are key constraints, often struggle with complex reasoning tasks required for agentic systems. To address this limitation, we fine-tune relatively small models such as Llama 3.1 (8B & 70B) using the proposed Pre-Act approach. Our experiments show that the fine-tuned 70B model outperforms GPT-4, achieving a 69.5% improvement in action accuracy (turn-level) and a 28% improvement in goal completion rate (end-to-end) on the Almita (out-of-domain) dataset.

摘要

大型语言模型(LLMs)中的ReAct(推理+行动)能力已成为现代代理系统的基础。近期诸如DeepSeek-R1和OpenAI o1/o3等模型通过生成大量中间推理标记强化了这一特性,这些标记在输出最终结果前构建了坚实的前提基础。本文提出Pre-Act方法,该创新方案通过为给定用户输入创建包含详细推理的多步骤执行计划来提升代理性能。该计划逐步整合先前步骤及工具输出,并在每一步执行后自我优化直至获得最终响应。我们的方法同时适用于对话型与非对话型代理。为全面评估任务导向型代理性能,我们提出两级评估框架:(1)轮次层面;(2)端到端层面。在Almita数据集上的实验表明,五个模型的平均轮次级评估中,Pre-Act方法在行动召回率上较ReAct提升70%。虽然该方法对大模型效果显著,但在实际应用中受延迟和成本限制的关键小模型往往难以胜任代理系统所需的复杂推理任务。为此,我们采用Pre-Act方法对Llama 3.1(8B & 70B)等较小模型进行微调。实验显示,微调后的70B模型在Almita(跨领域)数据集上表现优于GPT-4,其行动准确率(轮次级)提升69.5%,目标完成率(端到端)提高28%。


ServeGen: Workload Characterization and Generation of Large Language Model Serving in Production

Abstract

arXiv:2505.09999v1 Announce Type: new Abstract: With the widespread adoption of Large Language Models (LLMs), serving LLM inference requests has become an increasingly important task, attracting active research advancements. Practical workloads play an essential role in this process: they are critical for motivating and benchmarking serving techniques and systems. However, the existing understanding of real-world LLM serving workloads is limited due to the lack of a comprehensive workload characterization. Prior analyses remain insufficient in scale and scope, thus failing to fully capture intricate workload characteristics. In this paper, we fill the gap with an in-depth characterization of LLM serving workloads collected from our worldwide cloud inference serving service, covering not only language models but also emerging multimodal and reasoning models, and unveiling important new findings in each case. Moreover, based on our findings, we propose ServeGen, a principled framework for generating realistic LLM serving workloads by composing them on a per-client basis. A practical use case in production validates that ServeGen avoids 50% under-provisioning compared to naive workload generation, demonstrating ServeGen's advantage in performance benchmarking. We will open-source ServeGen to foster future research.

摘要

随着大语言模型(LLMs)的广泛采用,处理LLM推理请求已成为日益重要的任务,并推动了相关研究的活跃进展。实际工作负载在此过程中起着关键作用:它们对激励和评估服务技术与系统至关重要。然而,由于缺乏全面的工作负载特征分析,目前对现实世界LLM服务负载的理解仍存在局限。先前研究在规模和范围上均显不足,因而未能充分捕捉复杂的负载特性。

本文通过深度分析从全球云推理服务平台收集的LLM服务负载,填补了这一空白。研究不仅涵盖语言模型,还包括新兴的多模态与推理模型,并在每种情况下揭示了重要的新发现。基于这些发现,我们提出了ServeGen——一种通过按客户端组合生成真实LLM服务负载的原则性框架。实际生产中的用例验证表明,与简单负载生成方法相比,ServeGen可避免50%的资源供给不足,证明了其在性能基准测试中的优势。我们将开源ServeGen以促进未来研究。


From Text to Network: Constructing a Knowledge Graph of Taiwan-Based China Studies Using Generative AI

Abstract

arXiv:2505.10093v1 Announce Type: new Abstract: Taiwanese China Studies (CS) has developed into a rich, interdisciplinary research field shaped by the unique geopolitical position and long standing academic engagement with Mainland China. This study responds to the growing need to systematically revisit and reorganize decades of Taiwan based CS scholarship by proposing an AI assisted approach that transforms unstructured academic texts into structured, interactive knowledge representations. We apply generative AI (GAI) techniques and large language models (LLMs) to extract and standardize entity relation triples from 1,367 peer reviewed CS articles published between 1996 and 2019. These triples are then visualized through a lightweight D3.js based system, forming the foundation of a domain specific knowledge graph and vector database for the field. This infrastructure allows users to explore conceptual nodes and semantic relationships across the corpus, revealing previously uncharted intellectual trajectories, thematic clusters, and research gaps. By decomposing textual content into graph structured knowledge units, our system enables a paradigm shift from linear text consumption to network based knowledge navigation. In doing so, it enhances scholarly access to CS literature while offering a scalable, data driven alternative to traditional ontology construction. This work not only demonstrates how generative AI can augment area studies and digital humanities but also highlights its potential to support a reimagined scholarly infrastructure for regional knowledge systems.

摘要

台湾的中国研究(CS)已发展成为一个丰富多元的跨学科研究领域,其形成受到台湾独特的地缘政治地位及与大陆长期学术交流的影响。为应对系统性重审与整合数十年来台湾CS学术成果的迫切需求,本研究提出一种人工智能辅助方法,将非结构化学术文本转化为结构化、可交互的知识表征。我们运用生成式人工智能(GAI)技术和大语言模型(LLMs),从1996至2019年间发表的1,367篇CS同行评议论文中提取并标准化实体关系三元组,随后通过基于D3.js的轻量级系统进行可视化,构建该领域专用知识图谱与向量数据库的基础架构。该基础设施使用户能探索语料库中的概念节点与语义关系,揭示未被发现的知识轨迹、主题集群与研究空白。通过将文本内容解构为图结构知识单元,本系统实现了从线性文本消费到基于网络的知识导航的范式转变,既提升了学者对CS文献的获取效率,也为传统本体构建提供了可扩展的数据驱动替代方案。本研究不仅展示了生成式AI如何增强区域研究与数字人文,更凸显了其支持区域性知识系统重塑学术基础设施的潜力。


MASS: Multi-Agent Simulation Scaling for Portfolio Construction

Abstract

arXiv:2505.10278v1 Announce Type: new Abstract: LLM-based multi-agent has gained significant attention for their potential in simulation and enhancing performance. However, existing works are limited to pure simulations or are constrained by predefined workflows, restricting their applicability and effectiveness. In this paper, we introduce the Multi-Agent Scaling Simulation (MASS) for portfolio construction. MASS achieves stable and continuous excess returns by progressively increasing the number of agents for large-scale simulations to gain a superior understanding of the market and optimizing agent distribution end-to-end through a reverse optimization process, rather than relying on a fixed workflow. We demonstrate its superiority through performance experiments, ablation studies, backtesting experiments, experiments on updated data and stock pools, scaling experiments, parameter sensitivity experiments, and visualization experiments, conducted in comparison with 6 state-of-the-art baselines on 3 challenging A-share stock pools. We expect the paradigm established by MASS to expand to other tasks with similar characteristics. The implementation of MASS has been open-sourced at https://github.com/gta0804/MASS.

摘要

基于大语言模型的多智能体系统因其在模拟和提升性能方面的潜力而受到广泛关注。然而,现有研究仅限于纯模拟或受限于预定义的工作流程,制约了其适用性和有效性。本文提出用于投资组合构建的多智能体规模化模拟(MASS)方法。MASS通过逐步增加智能体数量进行大规模模拟以深入理解市场,并通过逆向优化过程端到端优化智能体分布,而非依赖固定工作流程,从而实现稳定且持续的超额收益。我们在3个具有挑战性的A股股票池上,与6种最先进的基线方法进行了性能实验、消融研究、回测实验、更新数据和股票池实验、规模化实验、参数敏感性实验及可视化实验,验证了其优越性。我们期望MASS建立的范式能够扩展到具有类似特征的其他任务中。MASS的实现已开源在https://github.com/gta0804/MASS。


Leveraging Graph Retrieval-Augmented Generation to Support Learners' Understanding of Knowledge Concepts in MOOCs

Abstract

arXiv:2505.10074v1 Announce Type: new Abstract: Massive Open Online Courses (MOOCs) lack direct interaction between learners and instructors, making it challenging for learners to understand new knowledge concepts. Recently, learners have increasingly used Large Language Models (LLMs) to support them in acquiring new knowledge. However, LLMs are prone to hallucinations which limits their reliability. Retrieval-Augmented Generation (RAG) addresses this issue by retrieving relevant documents before generating a response. However, the application of RAG across different MOOCs is limited by unstructured learning material. Furthermore, current RAG systems do not actively guide learners toward their learning needs. To address these challenges, we propose a Graph RAG pipeline that leverages Educational Knowledge Graphs (EduKGs) and Personal Knowledge Graphs (PKGs) to guide learners to understand knowledge concepts in the MOOC platform CourseMapper. Specifically, we implement (1) a PKG-based Question Generation method to recommend personalized questions for learners in context, and (2) an EduKG-based Question Answering method that leverages the relationships between knowledge concepts in the EduKG to answer learner selected questions. To evaluate both methods, we conducted a study with 3 expert instructors on 3 different MOOCs in the MOOC platform CourseMapper. The results of the evaluation show the potential of Graph RAG to empower learners to understand new knowledge concepts in a personalized learning experience.

摘要

大规模开放在线课程(MOOCs)缺乏学习者与教师之间的直接互动,这使学习者在理解新知识概念时面临挑战。近年来,学习者越来越多地使用大语言模型(LLMs)来辅助获取新知识。然而,LLMs容易产生幻觉,这限制了其可靠性。检索增强生成(RAG)通过在生成响应前检索相关文档来解决这一问题。然而,非结构化的学习材料限制了RAG在不同MOOCs中的应用。此外,当前的RAG系统未能主动引导学习者满足其学习需求。为应对这些挑战,我们提出了一种图RAG流程,利用教育知识图谱(EduKGs)和个人知识图谱(PKGs)引导学习者在MOOC平台CourseMapper中理解知识概念。具体而言,我们实现了(1)基于PKG的问题生成方法,为学习者推荐上下文相关的个性化问题;(2)基于EduKG的问题回答方法,利用EduKG中知识概念之间的关系回答学习者选择的问题。为评估这两种方法,我们在CourseMapper平台上针对3门不同MOOCs课程与3位专家教师开展了研究。评估结果表明,图RAG在赋能学习者通过个性化学习体验理解新知识概念方面具有潜力。


Empirically evaluating commonsense intelligence in large language models with large-scale human judgments

Abstract

arXiv:2505.10309v1 Announce Type: new Abstract: Commonsense intelligence in machines is often assessed by static benchmarks that compare a model's output against human-prescribed correct labels. An important, albeit implicit, assumption of these labels is that they accurately capture what any human would think, effectively treating human common sense as homogeneous. However, recent empirical work has shown that humans vary enormously in what they consider commonsensical; thus what appears self-evident to one benchmark designer may not be so to another. Here, we propose a novel method for evaluating common sense in artificial intelligence (AI), specifically in large language models (LLMs), that incorporates empirically observed heterogeneity among humans by measuring the correspondence between a model's judgment and that of a human population. We first find that, when treated as independent survey respondents, most LLMs remain below the human median in their individual commonsense competence. Second, when used as simulators of a hypothetical population, LLMs correlate with real humans only modestly in the extent to which they agree on the same set of statements. In both cases, smaller, open-weight models are surprisingly more competitive than larger, proprietary frontier models. Our evaluation framework, which ties commonsense intelligence to its cultural basis, contributes to the growing call for adapting AI models to human collectivities that possess different, often incompatible, social stocks of knowledge.

摘要

机器常识智能通常通过静态基准测试进行评估,这些测试将模型的输出与人类预设的正确标签进行对比。这些标签隐含着一个重要假设:它们能准确反映所有人类的共识,实质上将人类常识视为同质化存在。然而最新实证研究表明,人类对常识的认知存在巨大差异——某个基准设计者认为不言而喻的结论,对其他人可能并非如此。为此,我们提出一种评估人工智能(尤其是大语言模型)常识的新方法,该方法通过测量模型判断与人类群体判断的对应关系,将实证观察到的人类异质性纳入考量。研究发现:首先,当被视为独立调查对象时,大多数大语言模型在个体常识能力上仍低于人类中位数水平;其次,当模拟假设人群时,大语言模型与真实人类在陈述认同度上仅呈现适度相关性。值得注意的是,在这两种情况下,较小规模的开源模型表现竟优于更大规模的专有前沿模型。我们的评估框架将常识智能与其文化基础相关联,响应了当前学界日益强烈的呼吁:需要使AI模型适应那些拥有不同(往往互不兼容)社会知识储备的人类群体。


Towards a Deeper Understanding of Reasoning Capabilities in Large Language Models

Abstract

arXiv:2505.10543v1 Announce Type: new Abstract: While large language models demonstrate impressive performance on static benchmarks, the true potential of large language models as self-learning and reasoning agents in dynamic environments remains unclear. This study systematically evaluates the efficacy of self-reflection, heuristic mutation, and planning as prompting techniques to test the adaptive capabilities of agents. We conduct experiments with various open-source language models in dynamic environments and find that larger models generally outperform smaller ones, but that strategic prompting can close this performance gap. Second, a too-long prompt can negatively impact smaller models on basic reactive tasks, while larger models show more robust behaviour. Third, advanced prompting techniques primarily benefit smaller models on complex games, but offer less improvement for already high-performing large language models. Yet, we find that advanced reasoning methods yield highly variable outcomes: while capable of significantly improving performance when reasoning and decision-making align, they also introduce instability and can lead to big performance drops. Compared to human performance, our findings reveal little evidence of true emergent reasoning. Instead, large language model performance exhibits persistent limitations in crucial areas such as planning, reasoning, and spatial coordination, suggesting that current-generation large language models still suffer fundamental shortcomings that may not be fully overcome through self-reflective prompting alone. Reasoning is a multi-faceted task, and while reasoning methods like Chain of thought improves multi-step reasoning on math word problems, our findings using dynamic benchmarks highlight important shortcomings in general reasoning capabilities, indicating a need to move beyond static benchmarks to capture the complexity of reasoning.

摘要

尽管大型语言模型在静态基准测试中展现出卓越性能,但其作为动态环境中自主学习和推理智能体的真正潜力仍不明确。本研究系统评估了自我反思、启发式变异和规划三种提示技术对智能体适应能力的提升效果。通过在动态环境中对多种开源语言模型进行实验,我们发现:首先,大模型通常优于小模型,但策略性提示能缩小这一性能差距;其次,过长的提示会损害小模型在基础反应任务中的表现,而大模型则展现出更强的鲁棒性;第三,高级提示技术主要提升小模型在复杂游戏中的表现,但对本已高性能的大模型改进有限。然而,我们发现高级推理方法会产生高度不稳定的结果——当推理与决策一致时可显著提升性能,但也可能引发不稳定并导致性能大幅下降。与人类表现相比,研究结果几乎没有发现真正涌现式推理的证据。当前大型语言模型在规划、推理和空间协调等关键领域仍存在持续局限,表明仅靠自我反思式提示可能无法完全克服这一代模型的根本缺陷。推理是多维度的任务,虽然'思维链'等方法能提升数学应用题的多步推理能力,但我们在动态基准测试中发现通用推理能力存在重要缺陷,这说明需要超越静态基准测试才能真正把握推理的复杂性。


An AI-Powered Research Assistant in the Lab: A Practical Guide for Text Analysis Through Iterative Collaboration with LLMs

Abstract

arXiv:2505.09724v1 Announce Type: cross Abstract: Analyzing texts such as open-ended responses, headlines, or social media posts is a time- and labor-intensive process highly susceptible to bias. LLMs are promising tools for text analysis, using either a predefined (top-down) or a data-driven (bottom-up) taxonomy, without sacrificing quality. Here we present a step-by-step tutorial to efficiently develop, test, and apply taxonomies for analyzing unstructured data through an iterative and collaborative process between researchers and LLMs. Using personal goals provided by participants as an example, we demonstrate how to write prompts to review datasets and generate a taxonomy of life domains, evaluate and refine the taxonomy through prompt and direct modifications, test the taxonomy and assess intercoder agreements, and apply the taxonomy to categorize an entire dataset with high intercoder reliability. We discuss the possibilities and limitations of using LLMs for text analysis.

摘要

分析开放式回答、新闻标题或社交媒体帖子等文本是一个耗时费力且极易产生偏差的过程。大型语言模型(LLMs)是文本分析的有力工具,既可采用预定义(自上而下)也可采用数据驱动(自下而上)的分类体系,同时不牺牲分析质量。本文通过研究者与LLMs之间的迭代协作流程,逐步演示如何高效开发、测试并应用分类体系来分析非结构化数据。以参与者提供的个人目标为例,我们展示了如何编写提示词来审阅数据集并生成生活领域分类体系,通过提示词调整和直接修改来评估优化该体系,测试分类体系并评估编码者间一致性,最终将该体系应用于整个数据集的分类工作且保持较高的编码者间信度。文中还探讨了使用LLMs进行文本分析的可能性与局限性。


System Prompt Optimization with Meta-Learning

Abstract

arXiv:2505.09666v1 Announce Type: cross Abstract: Large Language Models (LLMs) have shown remarkable capabilities, with optimizing their input prompts playing a pivotal role in maximizing their performance. However, while LLM prompts consist of both the task-agnostic system prompts and task-specific user prompts, existing work on prompt optimization has focused on user prompts specific to individual queries or tasks, and largely overlooked the system prompt that is, once optimized, applicable across different tasks and domains. Motivated by this, we introduce the novel problem of bilevel system prompt optimization, whose objective is to design system prompts that are robust to diverse user prompts and transferable to unseen tasks. To tackle this problem, we then propose a meta-learning framework, which meta-learns the system prompt by optimizing it over various user prompts across multiple datasets, while simultaneously updating the user prompts in an iterative manner to ensure synergy between them. We conduct experiments on 14 unseen datasets spanning 5 different domains, on which we show that our approach produces system prompts that generalize effectively to diverse user prompts. Also, our findings reveal that the optimized system prompt enables rapid adaptation even to unseen tasks, requiring fewer optimization steps for test-time user prompts while achieving improved performance.

摘要

大型语言模型(LLMs)已展现出卓越的能力,其中优化输入提示对最大化其性能起着关键作用。然而,尽管LLM提示包含与任务无关的系统提示和特定于任务的用户提示,现有关于提示优化的研究主要关注针对单个查询或任务的用户提示,而很大程度上忽略了系统提示——这种提示一旦优化,便可跨不同任务和领域适用。基于此,我们提出了双层系统提示优化这一新问题,其目标是设计对多样化用户提示具有鲁棒性且可迁移至未见任务的系统提示。为解决该问题,我们提出一个元学习框架,通过在多个数据集上针对不同用户提示优化系统提示进行元学习,同时以迭代方式更新用户提示以确保二者协同。我们在涵盖5个不同领域的14个未见数据集上进行实验,结果表明该方法生成的系统提示能有效泛化至多样化的用户提示。此外,研究发现优化后的系统提示即使对未见任务也能实现快速适应,测试时的用户提示只需更少优化步骤即可获得性能提升。


Exploring the generalization of LLM truth directions on conversational formats

Abstract

arXiv:2505.09807v1 Announce Type: cross Abstract: Several recent works argue that LLMs have a universal truth direction where true and false statements are linearly separable in the activation space of the model. It has been demonstrated that linear probes trained on a single hidden state of the model already generalize across a range of topics and might even be used for lie detection in LLM conversations. In this work we explore how this truth direction generalizes between various conversational formats. We find good generalization between short conversations that end on a lie, but poor generalization to longer formats where the lie appears earlier in the input prompt. We propose a solution that significantly improves this type of generalization by adding a fixed key phrase at the end of each conversation. Our results highlight the challenges towards reliable LLM lie detectors that generalize to new settings.

摘要

近期多项研究指出,大语言模型(LLM)存在一个通用真实性方向,即在模型的激活空间中,真实陈述与虚假陈述呈线性可分状态。研究表明,仅针对模型单个隐藏状态训练的线性探针,就能在多个主题上实现泛化,甚至可能用于检测LLM对话中的谎言。本研究探讨了这种真实性方向在不同对话形式间的泛化能力。实验发现,模型在以谎言结尾的简短对话间泛化效果良好,但对谎言出现在输入提示较早位置的长对话格式泛化能力较差。我们提出了一种解决方案:通过在每段对话末尾添加固定关键词组,显著改善了此类泛化问题。研究结果凸显了开发能适应新场景的可靠LLM谎言检测器所面临的挑战。


Trustless Autonomy: Understanding Motivations, Benefits and Governance Dilemma in Self-Sovereign Decentralized AI Agents

Abstract

arXiv:2505.09757v1 Announce Type: cross Abstract: The recent trend of self-sovereign Decentralized AI Agents (DeAgents) combines Large Language Model (LLM)-based AI agents with decentralization technologies such as blockchain smart contracts and trusted execution environments (TEEs). These tamper-resistant trustless substrates allow agents to achieve self-sovereignty through ownership of cryptowallet private keys and control of digital assets and social media accounts. DeAgent eliminates centralized control and reduces human intervention, addressing key trust concerns inherent in centralized AI systems. However, given ongoing challenges in LLM reliability such as hallucinations, this creates paradoxical tension between trustlessness and unreliable autonomy. This study addresses this empirical research gap through interviews with DeAgents stakeholders-experts, founders, and developers-to examine their motivations, benefits, and governance dilemmas. The findings will guide future DeAgents system and protocol design and inform discussions about governance in sociotechnical AI systems in the future agentic web.

摘要

近期兴起的自治理去中心化人工智能代理(DeAgents)趋势,将基于大语言模型(LLM)的AI代理与区块链智能合约、可信执行环境(TEE)等去中心化技术相结合。这些抗篡改的无信任基础设施使代理能够通过掌控加密钱包私钥、数字资产及社交媒体账户实现自主治理。DeAgents消除了中心化控制并减少人为干预,解决了中心化AI系统固有的关键信任问题。然而鉴于大语言模型在可靠性(如幻觉问题)方面持续存在的挑战,这导致无信任机制与不可靠自主性之间形成悖论性张力。本研究通过访谈DeAgents利益相关方(专家、创始人与开发者),实证考察其动机、优势与治理困境,以填补该领域研究空白。研究结果将为未来DeAgents系统与协议设计提供指导,并推动关于未来代理网络社会技术AI系统中治理议题的讨论。


Evaluating Large Language Models for the Generation of Unit Tests with Equivalence Partitions and Boundary Values

Abstract

arXiv:2505.09830v1 Announce Type: cross Abstract: The design and implementation of unit tests is a complex task many programmers neglect. This research evaluates the potential of Large Language Models (LLMs) in automatically generating test cases, comparing them with manual tests. An optimized prompt was developed, that integrates code and requirements, covering critical cases such as equivalence partitions and boundary values. The strengths and weaknesses of LLMs versus trained programmers were compared through quantitative metrics and manual qualitative analysis. The results show that the effectiveness of LLMs depends on well-designed prompts, robust implementation, and precise requirements. Although flexible and promising, LLMs still require human supervision. This work highlights the importance of manual qualitative analysis as an essential complement to automation in unit test evaluation.

摘要

单元测试的设计与实现是许多程序员忽视的复杂任务。本研究评估了大型语言模型(LLMs)在自动生成测试用例方面的潜力,并将其与人工测试进行对比。通过开发一种集成代码与需求的优化提示模板,覆盖了等价类划分和边界值等关键测试场景。采用定量指标与人工定性分析相结合的方法,比较了LLMs与训练有素的程序员的优劣势。结果表明,LLMs的有效性取决于精心设计的提示模板、健壮的实现以及精确的需求描述。尽管LLMs具有灵活性和应用前景,但仍需人工监督。本研究强调了人工定性分析作为单元测试评估中自动化手段重要补充的必要性。


Achieving Tokenizer Flexibility in Language Models through Heuristic Adaptation and Supertoken Learning

Abstract

arXiv:2505.09738v1 Announce Type: cross Abstract: Pretrained language models (LLMs) are often constrained by their fixed tokenization schemes, leading to inefficiencies and performance limitations, particularly for multilingual or specialized applications. This tokenizer lock-in presents significant challenges. standard methods to overcome this often require prohibitive computational resources. Although tokenizer replacement with heuristic initialization aims to reduce this burden, existing methods often require exhaustive residual fine-tuning and still may not fully preserve semantic nuances or adequately address the underlying compression inefficiencies. Our framework introduces two innovations: first, Tokenadapt, a model-agnostic tokenizer transplantation method, and second, novel pre-tokenization learning for multi-word Supertokens to enhance compression and reduce fragmentation. Tokenadapt initializes new unique token embeddings via a hybrid heuristic that combines two methods: a local estimate based on subword decomposition using the old tokenizer, and a global estimate utilizing the top-k semantically similar tokens from the original vocabulary. This methodology aims to preserve semantics while significantly minimizing retraining requirements. Empirical investigations validate both contributions: the transplantation heuristic successfully initializes unique tokens, markedly outperforming conventional baselines and sophisticated methods including Transtokenizer and ReTok, while our Supertokens achieve notable compression gains. Our zero-shot perplexity results demonstrate that the TokenAdapt hybrid initialization consistently yields lower perplexity ratios compared to both ReTok and TransTokenizer baselines across different base models and newly trained target tokenizers. TokenAdapt typically reduced the overall perplexity ratio significantly compared to ReTok, yielding at least a 2-fold improvement in these aggregate scores.

摘要

预训练语言模型(LLMs)常受限于其固定的分词方案,这会导致效率低下和性能局限,尤其在多语言或专业应用中表现显著。这种分词器锁定现象带来了重大挑战,而现有标准解决方法通常需要极高的计算资源。尽管通过启发式初始化替换分词器旨在减轻负担,但现有方法往往需要大量残差微调,且可能无法完整保留语义细微差异或有效解决底层压缩效率问题。我们提出包含两项创新的框架:其一为TokenAdapt——一种模型无关的分词器移植方法;其二为针对多词超令牌的新型预分词学习机制,以提升压缩率并减少碎片化。TokenAdapt通过混合启发式策略初始化新唯一令牌嵌入,该策略结合两种方法:基于旧分词器子词分解的局部估计,以及利用原始词汇表中top-k语义相似令牌的全局估计。此方法旨在保持语义的同时显著减少再训练需求。实证研究验证了双重贡献:移植启发式成功初始化了唯一令牌,其表现显著优于传统基线方法(包括Transtokenizer和ReTok等复杂方法);而超令牌方案则实现了显著的压缩增益。零样本困惑度结果表明:在不同基础模型和新训练目标分词器中,TokenAdapt混合初始化策略产生的困惑度比率始终低于ReTok和TransTokenizer基线。相较于ReTok,TokenAdapt通常能将总体困惑度比率显著降低至少2倍。


Contextual Phenotyping of Pediatric Sepsis Cohort Using Large Language Models

Abstract

arXiv:2505.09805v1 Announce Type: cross Abstract: Clustering patient subgroups is essential for personalized care and efficient resource use. Traditional clustering methods struggle with high-dimensional, heterogeneous healthcare data and lack contextual understanding. This study evaluates Large Language Model (LLM) based clustering against classical methods using a pediatric sepsis dataset from a low-income country (LIC), containing 2,686 records with 28 numerical and 119 categorical variables. Patient records were serialized into text with and without a clustering objective. Embeddings were generated using quantized LLAMA 3.1 8B, DeepSeek-R1-Distill-Llama-8B with low-rank adaptation(LoRA), and Stella-En-400M-V5 models. K-means clustering was applied to these embeddings. Classical comparisons included K-Medoids clustering on UMAP and FAMD-reduced mixed data. Silhouette scores and statistical tests evaluated cluster quality and distinctiveness. Stella-En-400M-V5 achieved the highest Silhouette Score (0.86). LLAMA 3.1 8B with the clustering objective performed better with higher number of clusters, identifying subgroups with distinct nutritional, clinical, and socioeconomic profiles. LLM-based methods outperformed classical techniques by capturing richer context and prioritizing key features. These results highlight potential of LLMs for contextual phenotyping and informed decision-making in resource-limited settings.

摘要

患者亚群聚类对个性化诊疗和资源优化至关重要。传统聚类方法难以处理高维异构的医疗数据且缺乏上下文理解能力。本研究基于低收入国家(LIC)2,686例儿科脓毒症数据集(含28个数值变量和119个分类变量),对比评估了基于大语言模型(LLM)的聚类方法与经典方法。患者记录被序列化为包含/不包含聚类目标的文本,分别采用量化版LLAMA 3.1 8B、低秩适配(LoRA)的DeepSeek-R1-Distill-Llama-8B及Stella-En-400M-V5模型生成嵌入向量,并通过K-means进行聚类。经典方法包括UMAP降维和混合数据FAMD降维后的K-Medoids聚类。轮廓系数和统计检验评估了聚类质量与区分度。结果显示:Stella-En-400M-V5获得最高轮廓系数(0.86);带聚类目标的LLAMA 3.1 8B在较多簇数时表现更优,能识别具有显著营养状况、临床特征和社会经济差异的亚群。基于LLM的方法通过捕捉丰富上下文和关键特征优先级,全面优于传统技术。这些发现凸显了LLM在资源受限环境中实现情境化表型分析和循证决策的潜力。


Do Large Language Models Know Conflict? Investigating Parametric vs. Non-Parametric Knowledge of LLMs for Conflict Forecasting

Abstract

arXiv:2505.09852v1 Announce Type: cross Abstract: Large Language Models (LLMs) have shown impressive performance across natural language tasks, but their ability to forecast violent conflict remains underexplored. We investigate whether LLMs possess meaningful parametric knowledge-encoded in their pretrained weights-to predict conflict escalation and fatalities without external data. This is critical for early warning systems, humanitarian planning, and policy-making. We compare this parametric knowledge with non-parametric capabilities, where LLMs access structured and unstructured context from conflict datasets (e.g., ACLED, GDELT) and recent news reports via Retrieval-Augmented Generation (RAG). Incorporating external information could enhance model performance by providing up-to-date context otherwise missing from pretrained weights. Our two-part evaluation framework spans 2020-2024 across conflict-prone regions in the Horn of Africa and the Middle East. In the parametric setting, LLMs predict conflict trends and fatalities relying only on pretrained knowledge. In the non-parametric setting, models receive summaries of recent conflict events, indicators, and geopolitical developments. We compare predicted conflict trend labels (e.g., Escalate, Stable Conflict, De-escalate, Peace) and fatalities against historical data. Our findings highlight the strengths and limitations of LLMs for conflict forecasting and the benefits of augmenting them with structured external knowledge.

摘要

大型语言模型(LLM)在自然语言任务中展现出卓越性能,但其预测暴力冲突的能力尚未得到充分探索。本研究旨在验证LLM是否具备有意义的参数化知识——即编码于预训练权重中的知识——能否在不依赖外部数据的情况下预测冲突升级与伤亡情况。这对早期预警系统、人道主义规划及政策制定至关重要。我们对比了参数化与非参数化两种能力:前者仅利用预训练权重,后者则通过检索增强生成(RAG)技术获取冲突数据集(如ACLED、GDELT)和近期新闻报道的结构化与非结构化上下文。整合外部信息可补充预训练权重中缺失的最新背景,从而提升模型表现。我们构建的双阶段评估框架覆盖2020-2024年间非洲之角和中东等冲突高发地区。参数化实验中,LLM仅凭预训练知识预测冲突趋势与伤亡;非参数化实验中,模型接收近期冲突事件摘要、指标及地缘政治动态。通过将预测的冲突趋势标签(如"升级"、"稳定冲突"、"降级"、"和平")及伤亡数据与历史记录对比,本研究揭示了LLM在冲突预测中的优势与局限,并论证了结构化外部知识增强的重要价值。


Personalizing Large Language Models using Retrieval Augmented Generation and Knowledge Graph

Abstract

arXiv:2505.09945v1 Announce Type: cross Abstract: The advent of large language models (LLMs) has allowed numerous applications, including the generation of queried responses, to be leveraged in chatbots and other conversational assistants. Being trained on a plethora of data, LLMs often undergo high levels of over-fitting, resulting in the generation of extra and incorrect data, thus causing hallucinations in output generation. One of the root causes of such problems is the lack of timely, factual, and personalized information fed to the LLM. In this paper, we propose an approach to address these problems by introducing retrieval augmented generation (RAG) using knowledge graphs (KGs) to assist the LLM in personalized response generation tailored to the users. KGs have the advantage of storing continuously updated factual information in a structured way. While our KGs can be used for a variety of frequently updated personal data, such as calendar, contact, and location data, we focus on calendar data in this paper. Our experimental results show that our approach works significantly better in understanding personal information and generating accurate responses compared to the baseline LLMs using personal data as text inputs, with a moderate reduction in response time.

摘要

大型语言模型(LLMs)的出现使得诸多应用成为可能,包括在聊天机器人和其他对话助手中生成查询响应。由于训练数据量庞大,LLMs常出现高度过拟合现象,导致生成多余且错误的数据,从而引发输出中的幻觉问题。此类问题的根本原因之一在于缺乏及时、真实且个性化的信息输入。本文提出一种解决方案,通过引入基于知识图谱(KGs)的检索增强生成(RAG)技术,辅助LLM生成适应用户需求的个性化响应。知识图谱的优势在于能以结构化方式存储持续更新的真实信息。虽然我们的知识图谱可应用于多种频繁更新的个人数据(如日程、联系人和位置信息),但本文重点研究日程数据。实验结果表明,与将个人数据作为文本输入的基线LLMs相比,我们的方法在理解个人信息和生成准确响应方面表现显著更优,且响应时间仅有适度增加。


Reinforced Interactive Continual Learning via Real-time Noisy Human Feedback

Abstract

arXiv:2505.09925v1 Announce Type: cross Abstract: This paper introduces an interactive continual learning paradigm where AI models dynamically learn new skills from real-time human feedback while retaining prior knowledge. This paradigm distinctively addresses two major limitations of traditional continual learning: (1) dynamic model updates using streaming, real-time human-annotated data, rather than static datasets with fixed labels, and (2) the assumption of clean labels, by explicitly handling the noisy feedback common in real-world interactions. To tackle these problems, we propose RiCL, a Reinforced interactive Continual Learning framework leveraging Large Language Models (LLMs) to learn new skills effectively from dynamic feedback. RiCL incorporates three key components: a temporal consistency-aware purifier to automatically discern clean from noisy samples in data streams; an interaction-aware direct preference optimization strategy to align model behavior with human intent by reconciling AI-generated and human-provided feedback; and a noise-resistant contrastive learning module that captures robust representations by exploiting inherent data relationships, thus avoiding reliance on potentially unreliable labels. Extensive experiments on two benchmark datasets (FewRel and TACRED), contaminated with realistic noise patterns, demonstrate that our RiCL approach substantially outperforms existing combinations of state-of-the-art online continual learning and noisy-label learning methods.

摘要

本文提出了一种交互式持续学习范式,使得人工智能模型能够通过实时人类反馈动态学习新技能,同时保留已有知识。该范式独特地解决了传统持续学习的两个主要局限:(1) 采用流式实时人工标注数据进行动态模型更新,而非使用固定标签的静态数据集;(2) 通过显式处理现实交互中常见的噪声反馈,突破了传统方法对干净标签的假设。针对这些问题,我们提出了RiCL框架——一种基于大语言模型(LLMs)的强化交互式持续学习方法,可有效从动态反馈中学习新技能。RiCL包含三个核心组件:时序一致性感知净化器,用于自动识别数据流中的干净样本与噪声样本;交互感知直接偏好优化策略,通过协调AI生成反馈与人工反馈来实现模型行为与人类意图的对齐;以及抗噪声对比学习模块,通过挖掘数据内在关系来获取鲁棒表征,从而避免对潜在不可靠标签的依赖。在两个包含真实噪声模式的基准数据集(FewRel和TACRED)上的大量实验表明,我们的RiCL方法显著优于现有最先进的在线持续学习与噪声标签学习方法的组合方案。


CartoAgent: a multimodal large language model-powered multi-agent cartographic framework for map style transfer and evaluation

Abstract

arXiv:2505.09936v1 Announce Type: cross Abstract: The rapid development of generative artificial intelligence (GenAI) presents new opportunities to advance the cartographic process. Previous studies have either overlooked the artistic aspects of maps or faced challenges in creating both accurate and informative maps. In this study, we propose CartoAgent, a novel multi-agent cartographic framework powered by multimodal large language models (MLLMs). This framework simulates three key stages in cartographic practice: preparation, map design, and evaluation. At each stage, different MLLMs act as agents with distinct roles to collaborate, discuss, and utilize tools for specific purposes. In particular, CartoAgent leverages MLLMs' visual aesthetic capability and world knowledge to generate maps that are both visually appealing and informative. By separating style from geographic data, it can focus on designing stylesheets without modifying the vector-based data, thereby ensuring geographic accuracy. We applied CartoAgent to a specific task centered on map restyling-namely, map style transfer and evaluation. The effectiveness of this framework was validated through extensive experiments and a human evaluation study. CartoAgent can be extended to support a variety of cartographic design decisions and inform future integrations of GenAI in cartography.

摘要

生成式人工智能(GenAI)的快速发展为推进制图流程提供了新机遇。既往研究或忽视地图的艺术性,或难以兼顾地图的精确性与信息丰富性。本研究提出CartoAgent——一个基于多模态大语言模型(MLLMs)的新型多智能体制图框架。该框架模拟制图实践的三个关键阶段:准备阶段、地图设计阶段和评估阶段。每个阶段由不同MLLMs担任特定角色代理,通过协作、讨论和工具调用实现目标。CartoAgent尤其注重利用MLLMs的视觉审美能力和世界知识,生成兼具视觉吸引力与信息价值的地图。通过将样式与地理数据分离,该框架可在不修改矢量数据的前提下专注于样式表设计,从而确保地理精度。我们将CartoAgent应用于以地图重样式化(即地图风格迁移与评估)为核心的任务,通过大量实验和人工评估验证了其有效性。该框架可扩展至多种制图设计决策,并为生成式人工智能在制图领域的未来集成提供参考。


Comparing Exploration-Exploitation Strategies of LLMs and Humans: Insights from Standard Multi-armed Bandit Tasks

Abstract

arXiv:2505.09901v1 Announce Type: cross Abstract: Large language models (LLMs) are increasingly used to simulate or automate human behavior in complex sequential decision-making tasks. A natural question is then whether LLMs exhibit similar decision-making behavior to humans, and can achieve comparable (or superior) performance. In this work, we focus on the exploration-exploitation (E&E) tradeoff, a fundamental aspect of dynamic decision-making under uncertainty. We employ canonical multi-armed bandit (MAB) tasks introduced in the cognitive science and psychiatry literature to conduct a comparative study of the E&E strategies of LLMs, humans, and MAB algorithms. We use interpretable choice models to capture the E&E strategies of the agents and investigate how explicit reasoning, through both prompting strategies and reasoning-enhanced models, shapes LLM decision-making. We find that reasoning shifts LLMs toward more human-like behavior, characterized by a mix of random and directed exploration. In simple stationary tasks, reasoning-enabled LLMs exhibit similar levels of random and directed exploration compared to humans. However, in more complex, non-stationary environments, LLMs struggle to match human adaptability, particularly in effective directed exploration, despite achieving similar regret in certain scenarios. Our findings highlight both the promise and limits of LLMs as simulators of human behavior and tools for automated decision-making and point to potential areas of improvements.

摘要

大型语言模型(LLMs)正日益被用于模拟或自动化人类在复杂序列决策任务中的行为。一个自然的问题是,LLMs是否表现出与人类相似的决策行为,并能达到相当(或更优)的性能。本研究聚焦于探索-利用(E&E)权衡这一不确定性下动态决策的基本问题。我们采用认知科学与精神病学文献中提出的经典多臂老虎机(MAB)任务,对LLMs、人类和MAB算法的E&E策略进行比较研究。通过可解释的选择模型捕捉智能体的E&E策略,并探究显式推理(通过提示策略和推理增强模型)如何影响LLM的决策。研究发现,推理使LLMs更趋近于人类行为特征,表现为随机探索与定向探索的混合。在简单静态任务中,具备推理能力的LLMs表现出与人类相近的随机和定向探索水平;而在更复杂的非静态环境中,尽管在某些场景下实现了相似的遗憾值,LLMs仍难以匹配人类的适应能力,尤其在有效定向探索方面存在不足。我们的发现既揭示了LLMs作为人类行为模拟器和自动化决策工具的潜力,也指出了其局限性,并为可能的改进方向提供了参考。


Analysing Safety Risks in LLMs Fine-Tuned with Pseudo-Malicious Cyber Security Data

Abstract

arXiv:2505.09974v1 Announce Type: cross Abstract: The integration of large language models (LLMs) into cyber security applications presents significant opportunities, such as enhancing threat analysis and malware detection, but can also introduce critical risks and safety concerns, including personal data leakage and automated generation of new malware. We present a systematic evaluation of safety risks in fine-tuned LLMs for cyber security applications. Using the OWASP Top 10 for LLM Applications framework, we assessed seven open-source LLMs: Phi 3 Mini 3.8B, Mistral 7B, Qwen 2.5 7B, Llama 3 8B, Llama 3.1 8B, Gemma 2 9B, and Llama 2 70B. Our evaluation shows that fine-tuning reduces safety resilience across all tested LLMs (e.g., the safety score of Llama 3.1 8B against prompt injection drops from 0.95 to 0.15). We propose and evaluate a safety alignment approach that carefully rewords instruction-response pairs to include explicit safety precautions and ethical considerations. This approach demonstrates that it is possible to maintain or even improve model safety while preserving technical utility, offering a practical path forward for developing safer fine-tuning methodologies. This work offers a systematic evaluation for safety risks in LLMs, enabling safer adoption of generative AI in sensitive domains, and contributing towards the development of secure, trustworthy, and ethically aligned LLMs.

摘要

将大型语言模型(LLMs)整合至网络安全应用虽能带来显著机遇(如提升威胁分析与恶意软件检测能力),但同时也可能引发关键风险与安全隐患,包括个人数据泄露和自动化生成新型恶意软件。本研究对网络安全领域微调LLMs的安全风险进行了系统性评估。基于OWASP LLM应用十大风险框架,我们测试了七款开源LLMs:Phi 3 Mini 3.8B、Mistral 7B、Qwen 2.5 7B、Llama 3 8B、Llama 3.1 8B、Gemma 2 9B及Llama 2 70B。评估表明微调会普遍降低模型的安全韧性(例如Llama 3.1 8B在提示注入攻击下的安全评分从0.95降至0.15)。我们提出并验证了一种安全对齐方法,通过审慎重构指令-响应对以纳入明确的安全预防措施与伦理考量。该方法证实了在保持技术实用性的同时维持乃至提升模型安全性的可行性,为开发更安全的微调方法提供了实践路径。本研究为LLMs安全风险提供了系统化评估框架,有助于在敏感领域更安全地采用生成式AI,并推动开发安全、可信且符合伦理的LLMs。


Dark LLMs: The Growing Threat of Unaligned AI Models

Abstract

arXiv:2505.10066v1 Announce Type: cross Abstract: Large Language Models (LLMs) rapidly reshape modern life, advancing fields from healthcare to education and beyond. However, alongside their remarkable capabilities lies a significant threat: the susceptibility of these models to jailbreaking. The fundamental vulnerability of LLMs to jailbreak attacks stems from the very data they learn from. As long as this training data includes unfiltered, problematic, or 'dark' content, the models can inherently learn undesirable patterns or weaknesses that allow users to circumvent their intended safety controls. Our research identifies the growing threat posed by dark LLMs models deliberately designed without ethical guardrails or modified through jailbreak techniques. In our research, we uncovered a universal jailbreak attack that effectively compromises multiple state-of-the-art models, enabling them to answer almost any question and produce harmful outputs upon request. The main idea of our attack was published online over seven months ago. However, many of the tested LLMs were still vulnerable to this attack. Despite our responsible disclosure efforts, responses from major LLM providers were often inadequate, highlighting a concerning gap in industry practices regarding AI safety. As model training becomes more accessible and cheaper, and as open-source LLMs proliferate, the risk of widespread misuse escalates. Without decisive intervention, LLMs may continue democratizing access to dangerous knowledge, posing greater risks than anticipated.

摘要

大型语言模型(LLMs)正迅速重塑现代生活,推动从医疗保健到教育等诸多领域的发展。然而,在其卓越能力背后潜藏着重大威胁:这些模型对越狱攻击的脆弱性。LLMs易受越狱攻击的根本原因在于其学习的数据本身。只要训练数据包含未经过滤、有问题的或"黑暗"内容,模型就可能习得不良模式或弱点,使用户能够绕过其设计的安全控制机制。我们的研究发现了日益增长的"黑暗LLMs"威胁——这些模型被刻意设计为缺乏伦理约束,或通过越狱技术进行修改。在研究中,我们发现了一种通用越狱攻击方法,能够有效攻破多个最先进模型,使其能够回答几乎所有问题并根据请求生成有害输出。该攻击的核心思路早在七个多月前就已公开发布,但许多受测LLM仍存在此漏洞。尽管我们进行了负责任的披露,主要LLM提供商的应对措施往往不足,这凸显出行业在AI安全实践方面的严重缺陷。随着模型训练门槛降低、成本下降,以及开源LLMs的激增,大规模滥用的风险正在加剧。若不采取果断干预措施,LLMs可能持续推动危险知识的平民化,带来远超预期的风险。


The Evolving Landscape of Generative Large Language Models and Traditional Natural Language Processing in Medicine

Abstract

arXiv:2505.10261v1 Announce Type: cross Abstract: Natural language processing (NLP) has been traditionally applied to medicine, and generative large language models (LLMs) have become prominent recently. However, the differences between them across different medical tasks remain underexplored. We analyzed 19,123 studies, finding that generative LLMs demonstrate advantages in open-ended tasks, while traditional NLP dominates in information extraction and analysis tasks. As these technologies advance, ethical use of them is essential to ensure their potential in medical applications.

摘要

自然语言处理(NLP)在医学领域历来有广泛应用,而生成式大语言模型(LLMs)近年来逐渐崭露头角。然而,两者在不同医疗任务中的差异仍缺乏深入探讨。通过分析19,123项研究,我们发现生成式LLMs在开放式任务中展现出优势,而传统NLP则在信息提取与分析任务中占据主导地位。随着这些技术的进步,如何合乎伦理地运用它们对实现其在医疗应用中的潜力至关重要。


Private Transformer Inference in MLaaS: A Survey

Abstract

arXiv:2505.10315v1 Announce Type: cross Abstract: Transformer models have revolutionized AI, powering applications like content generation and sentiment analysis. However, their deployment in Machine Learning as a Service (MLaaS) raises significant privacy concerns, primarily due to the centralized processing of sensitive user data. Private Transformer Inference (PTI) offers a solution by utilizing cryptographic techniques such as secure multi-party computation and homomorphic encryption, enabling inference while preserving both user data and model privacy. This paper reviews recent PTI advancements, highlighting state-of-the-art solutions and challenges. We also introduce a structured taxonomy and evaluation framework for PTI, focusing on balancing resource efficiency with privacy and bridging the gap between high-performance inference and data privacy.

摘要

Transformer模型彻底改变了人工智能领域,为内容生成和情感分析等应用提供了强大支持。然而,其在机器学习即服务(MLaaS)中的部署引发了重大隐私问题,主要源于敏感用户数据的集中处理。私有Transformer推理(PTI)通过采用安全多方计算和同态加密等密码学技术,在保护用户数据和模型隐私的同时实现推理功能,为此提供了解决方案。本文综述了PTI领域的最新进展,重点介绍了前沿解决方案与现存挑战。我们还提出了一套结构化的PTI分类体系与评估框架,旨在资源效率与隐私保护之间实现平衡,并弥合高性能推理与数据隐私之间的鸿沟。


The CoT Encyclopedia: Analyzing, Predicting, and Controlling how a Reasoning Model will Think

Abstract

arXiv:2505.10185v1 Announce Type: cross Abstract: Long chain-of-thought (CoT) is an essential ingredient in effective usage of modern large language models, but our understanding of the reasoning strategies underlying these capabilities remains limited. While some prior works have attempted to categorize CoTs using predefined strategy types, such approaches are constrained by human intuition and fail to capture the full diversity of model behaviors. In this work, we introduce the CoT Encyclopedia, a bottom-up framework for analyzing and steering model reasoning. Our method automatically extracts diverse reasoning criteria from model-generated CoTs, embeds them into a semantic space, clusters them into representative categories, and derives contrastive rubrics to interpret reasoning behavior. Human evaluations show that this framework produces more interpretable and comprehensive analyses than existing methods. Moreover, we demonstrate that this understanding enables performance gains: we can predict which strategy a model is likely to use and guide it toward more effective alternatives. Finally, we provide practical insights, such as that training data format (e.g., free-form vs. multiple-choice) has a far greater impact on reasoning behavior than data domain, underscoring the importance of format-aware model design.

摘要

长链思维(CoT)是现代大语言模型有效运用的关键要素,但我们对这些能力背后的推理策略理解仍然有限。尽管先前的一些研究尝试使用预定义的策略类型对CoT进行分类,但这类方法受限于人类直觉,无法全面捕捉模型行为的多样性。本研究提出“CoT百科全书”,一种自下而上的框架用于分析和引导模型推理。我们的方法自动从模型生成的CoT中提取多样化的推理标准,将其嵌入语义空间,聚类为代表性类别,并通过对比性评估标准解释推理行为。人工评估表明,该框架比现有方法产生更具可解释性和全面性的分析。此外,我们证明这种理解能够提升性能:可以预测模型可能使用的策略,并引导其采用更有效的替代方案。最后,我们提供实践洞见,例如训练数据格式(如自由形式与多项选择)对推理行为的影响远大于数据领域,这强调了格式感知模型设计的重要性。


Comparing LLM Text Annotation Skills: A Study on Human Rights Violations in Social Media Data

Abstract

arXiv:2505.10260v1 Announce Type: cross Abstract: In the era of increasingly sophisticated natural language processing (NLP) systems, large language models (LLMs) have demonstrated remarkable potential for diverse applications, including tasks requiring nuanced textual understanding and contextual reasoning. This study investigates the capabilities of multiple state-of-the-art LLMs - GPT-3.5, GPT-4, LLAMA3, Mistral 7B, and Claude-2 - for zero-shot and few-shot annotation of a complex textual dataset comprising social media posts in Russian and Ukrainian. Specifically, the focus is on the binary classification task of identifying references to human rights violations within the dataset. To evaluate the effectiveness of these models, their annotations are compared against a gold standard set of human double-annotated labels across 1000 samples. The analysis includes assessing annotation performance under different prompting conditions, with prompts provided in both English and Russian. Additionally, the study explores the unique patterns of errors and disagreements exhibited by each model, offering insights into their strengths, limitations, and cross-linguistic adaptability. By juxtaposing LLM outputs with human annotations, this research contributes to understanding the reliability and applicability of LLMs for sensitive, domain-specific tasks in multilingual contexts. It also sheds light on how language models handle inherently subjective and context-dependent judgments, a critical consideration for their deployment in real-world scenarios.

摘要

在自然语言处理(NLP)系统日益复杂的时代,大型语言模型(LLMs)已展现出在多样化应用中的显著潜力,包括需要细致文本理解和上下文推理的任务。本研究调查了多种前沿LLMs(GPT-3.5、GPT-4、LLAMA3、Mistral 7B和Claude-2)在零样本和少样本标注复杂文本数据集(包含俄语和乌克兰语的社交媒体帖子)方面的能力,特别关注识别数据集中涉及侵犯人权内容的二元分类任务。为评估这些模型的有效性,将其标注结果与1000个样本的人工双重标注黄金标准集进行对比。分析包括评估不同提示条件下(使用英语和俄语提示)的标注性能,并探究各模型表现出的独特错误模式和分歧,从而揭示其优势、局限性和跨语言适应性。通过对比LLMs输出与人工标注,本研究有助于理解LLMs在多语言环境下处理敏感领域特定任务的可靠性和适用性,同时揭示了语言模型如何处理本质上具有主观性和语境依赖性的判断——这对其实际场景部署至关重要。


AutoPentest: Enhancing Vulnerability Management With Autonomous LLM Agents

Abstract

arXiv:2505.10321v1 Announce Type: cross Abstract: A recent area of increasing research is the use of Large Language Models (LLMs) in penetration testing, which promises to reduce costs and thus allow for higher frequency. We conduct a review of related work, identifying best practices and common evaluation issues. We then present AutoPentest, an application for performing black-box penetration tests with a high degree of autonomy. AutoPentest is based on the LLM GPT-4o from OpenAI and the LLM agent framework LangChain. It can perform complex multi-step tasks, augmented by external tools and knowledge bases. We conduct a study on three capture-the-flag style Hack The Box (HTB) machines, comparing our implementation AutoPentest with the baseline approach of manually using the ChatGPT-4o user interface. Both approaches are able to complete 15-25 % of the subtasks on the HTB machines, with AutoPentest slightly outperforming ChatGPT. We measure a total cost of $96.20 US when using AutoPentest across all experiments, while a one-month subscription to ChatGPT Plus costs $20. The results show that further implementation efforts and the use of more powerful LLMs released in the future are likely to make this a viable part of vulnerability management.

摘要

近年来,大型语言模型(LLM)在渗透测试中的应用研究日益增多,该方法有望降低成本从而实现更高测试频率。我们对相关研究进行了系统性综述,归纳出最佳实践方案并指出常见评估问题。随后提出AutoPentest——一个具有高度自主性的黑盒渗透测试应用程序。该系统基于OpenAI的GPT-4o语言模型和LangChain智能体框架构建,通过外部工具与知识库增强,可执行复杂的多步骤测试任务。我们在三台夺旗式Hack The Box(HTB)测试机上开展实验,将AutoPentest与基于ChatGPT-4o人工操作界面基线方法进行对比。两种方法均能完成15-25%的HTB子任务,其中AutoPentest表现略优。实验测得AutoPentest总成本为96.20美元,而ChatGPT Plus月订阅费为20美元。结果表明,随着未来更强大LLM的发布及系统优化,该方法有望成为漏洞管理体系中可行的组成部分。


J1: Incentivizing Thinking in LLM-as-a-Judge via Reinforcement Learning

Abstract

arXiv:2505.10320v1 Announce Type: cross Abstract: The progress of AI is bottlenecked by the quality of evaluation, and powerful LLM-as-a-Judge models have proved to be a core solution. Improved judgment ability is enabled by stronger chain-of-thought reasoning, motivating the need to find the best recipes for training such models to think. In this work we introduce J1, a reinforcement learning approach to training such models. Our method converts both verifiable and non-verifiable prompts to judgment tasks with verifiable rewards that incentivize thinking and mitigate judgment bias. In particular, our approach outperforms all other existing 8B or 70B models when trained at those sizes, including models distilled from DeepSeek-R1. J1 also outperforms o1-mini, and even R1 on some benchmarks, despite training a smaller model. We provide analysis and ablations comparing Pairwise-J1 vs Pointwise-J1 models, offline vs online training recipes, reward strategies, seed prompts, and variations in thought length and content. We find that our models make better judgments by learning to outline evaluation criteria, comparing against self-generated reference answers, and re-evaluating the correctness of model responses.

摘要

人工智能的发展正面临评估质量瓶颈,而强大的大语言模型即评判员(LLM-as-a-Judge)已被证明是核心解决方案。通过增强的思维链推理能力可提升判断性能,这促使我们需要寻找训练此类思维模型的最佳方法。本研究提出J1——一种基于强化学习的模型训练方法。该方法将可验证与不可验证的提示均转化为具有可验证奖励的判断任务,从而激励深度思考并减少判断偏差。值得注意的是,当模型规模分别为80亿和700亿参数时,我们的方法在性能上超越了所有同规模现有模型(包括基于DeepSeek-R1提炼的模型)。J1不仅优于o1-mini模型,在某些基准测试中甚至超越了R1模型,尽管其训练规模更小。我们通过对比分析Pairwise-J1与Pointwise-J1模型、离线与在线训练方案、奖励策略、种子提示以及思维长度与内容的变化,提供了系统性研究。研究发现,我们的模型通过学会概述评估标准、与自生成参考答案进行对比,以及重新评估模型响应的正确性,从而做出更优判断。


FactsR: A Safer Method for Producing High Quality Healthcare Documentation

Abstract

arXiv:2505.10360v1 Announce Type: cross Abstract: There are now a multitude of AI-scribing solutions for healthcare promising the utilization of large language models for ambient documentation. However, these AI scribes still rely on one-shot, or few-shot prompts for generating notes after the consultation has ended, employing little to no reasoning. This risks long notes with an increase in hallucinations, misrepresentation of the intent of the clinician, and reliance on the proofreading of the clinician to catch errors. A dangerous combination for patient safety if vigilance is compromised by workload and fatigue. In this paper, we introduce a method for extracting salient clinical information in real-time alongside the healthcare consultation, denoted Facts, and use that information recursively to generate the final note. The FactsR method results in more accurate and concise notes by placing the clinician-in-the-loop of note generation, while opening up new use cases within real-time decision support.

摘要

目前医疗领域涌现出众多AI记录解决方案,承诺利用大语言模型实现环境感知式文档生成。然而这些AI记录器仍依赖于单次或少量提示词在问诊结束后生成病历,几乎不采用任何推理机制。这种做法可能导致病历冗长、幻觉内容增加、曲解临床医生意图,并依赖医生校对纠错。若工作负荷与疲劳削弱了警觉性,这种危险组合将危及患者安全。本文提出一种在医疗问诊过程中实时提取关键临床信息(称为事实)的方法,并递归利用该信息生成最终病历。通过将临床医生纳入病历生成的闭环,FactsR方法能生成更准确简洁的病历,同时为实时决策支持开辟新的应用场景。


Do LLMs Memorize Recommendation Datasets? A Preliminary Study on MovieLens-1M

Abstract

arXiv:2505.10212v1 Announce Type: cross Abstract: Large Language Models (LLMs) have become increasingly central to recommendation scenarios due to their remarkable natural language understanding and generation capabilities. Although significant research has explored the use of LLMs for various recommendation tasks, little effort has been dedicated to verifying whether they have memorized public recommendation dataset as part of their training data. This is undesirable because memorization reduces the generalizability of research findings, as benchmarking on memorized datasets does not guarantee generalization to unseen datasets. Furthermore, memorization can amplify biases, for example, some popular items may be recommended more frequently than others. In this work, we investigate whether LLMs have memorized public recommendation datasets. Specifically, we examine two model families (GPT and Llama) across multiple sizes, focusing on one of the most widely used dataset in recommender systems: MovieLens-1M. First, we define dataset memorization as the extent to which item attributes, user profiles, and user-item interactions can be retrieved by prompting the LLMs. Second, we analyze the impact of memorization on recommendation performance. Lastly, we examine whether memorization varies across model families and model sizes. Our results reveal that all models exhibit some degree of memorization of MovieLens-1M, and that recommendation performance is related to the extent of memorization. We have made all the code publicly available at: https://github.com/sisinflab/LLM-MemoryInspector

摘要

大语言模型(LLMs)因其卓越的自然语言理解与生成能力,在推荐场景中日益占据核心地位。尽管已有大量研究探索了LLMs在各种推荐任务中的应用,但鲜有工作致力于验证这些模型是否已将公开推荐数据集作为训练数据记忆。这一现象值得关注,因为记忆行为会降低研究结论的普适性——基于被记忆数据集的基准测试无法保证模型在未见数据集上的泛化能力。此外,记忆还可能放大偏见,例如导致某些热门物品被更频繁地推荐。

本研究针对LLMs是否记忆了公开推荐数据集展开探究。我们以推荐系统领域最广泛使用的MovieLens-1M数据集为研究对象,考察了GPT和Llama两个模型家族的不同规模版本。首先,我们将数据集记忆定义为通过提示LLMs可检索物品属性、用户画像及用户-物品交互的程度;其次,分析了记忆对推荐性能的影响;最后,检验了记忆行为在不同模型家族和规模间的差异。实验结果表明:所有模型均表现出对MovieLens-1M不同程度的记忆,且推荐性能与记忆程度存在关联。相关代码已公开于:https://github.com/sisinflab/LLM-MemoryInspector


Rethinking Repetition Problems of LLMs in Code Generation

Abstract

arXiv:2505.10402v1 Announce Type: cross Abstract: With the advent of neural language models, the performance of code generation has been significantly boosted. However, the problem of repetitions during the generation process continues to linger. Previous work has primarily focused on content repetition, which is merely a fraction of the broader repetition problem in code generation. A more prevalent and challenging problem is structural repetition. In structural repetition, the repeated code appears in various patterns but possesses a fixed structure, which can be inherently reflected in grammar. In this paper, we formally define structural repetition and propose an efficient decoding approach called RPG, which stands for Repetition Penalization based on Grammar, to alleviate the repetition problems in code generation for LLMs. Specifically, RPG first leverages grammar rules to identify repetition problems during code generation, and then strategically decays the likelihood of critical tokens that contribute to repetitions, thereby mitigating them in code generation. To facilitate this study, we construct a new dataset CodeRepetEval to comprehensively evaluate approaches for mitigating the repetition problems in code generation. Extensive experimental results demonstrate that RPG substantially outperforms the best-performing baselines on CodeRepetEval dataset as well as HumanEval and MBPP benchmarks, effectively reducing repetitions and enhancing the quality of generated code.

摘要

随着神经语言模型的兴起,代码生成性能得到了显著提升。然而,生成过程中的重复问题依然存在。先前研究主要关注内容重复,但这仅是代码生成中广泛重复问题的一个方面。更具普遍性和挑战性的是结构重复问题——重复代码以不同模式出现却保持固定结构,这种特性本质上可通过语法反映。本文正式定义了结构重复,并提出一种名为RPG(基于语法的重复惩罚)的高效解码方法,以缓解大语言模型代码生成中的重复问题。具体而言,RPG首先利用语法规则识别代码生成过程中的重复现象,继而策略性地衰减导致重复的关键标记的生成概率,从而有效抑制重复。为推进本研究,我们构建了新数据集CodeRepetEval,用于全面评估代码生成重复问题的缓解方法。大量实验结果表明,RPG在CodeRepetEval数据集及HumanEval和MBPP基准测试上显著优于现有最佳基线方法,能有效减少重复并提升生成代码质量。


Are Sparse Autoencoders Useful for Java Function Bug Detection?

Abstract

arXiv:2505.10375v1 Announce Type: cross Abstract: Software vulnerabilities such as buffer overflows and SQL injections are a major source of security breaches. Traditional methods for vulnerability detection remain essential but are limited by high false positive rates, scalability issues, and reliance on manual effort. These constraints have driven interest in AI-based approaches to automated vulnerability detection and secure code generation. While Large Language Models (LLMs) have opened new avenues for classification tasks, their complexity and opacity pose challenges for interpretability and deployment. Sparse Autoencoder offer a promising solution to this problem. We explore whether SAEs can serve as a lightweight, interpretable alternative for bug detection in Java functions. We evaluate the effectiveness of SAEs when applied to representations from GPT-2 Small and Gemma 2B, examining their capacity to highlight buggy behaviour without fine-tuning the underlying LLMs. We found that SAE-derived features enable bug detection with an F1 score of up to 89%, consistently outperforming fine-tuned transformer encoder baselines. Our work provides the first empirical evidence that SAEs can be used to detect software bugs directly from the internal representations of pretrained LLMs, without any fine-tuning or task-specific supervision.

摘要

缓冲区溢出和SQL注入等软件漏洞是安全漏洞的主要来源。传统的漏洞检测方法虽然仍不可或缺,但存在误报率高、可扩展性不足以及依赖人工操作等局限性。这些限制促使人们关注基于人工智能的自动化漏洞检测与安全代码生成方法。尽管大语言模型(LLMs)为分类任务开辟了新途径,但其复杂性和不透明性给可解释性和部署带来了挑战。稀疏自编码器(SAEs)为该问题提供了可行的解决方案。本研究探讨了SAEs能否作为Java函数错误检测的轻量级、可解释替代方案。我们评估了SAEs应用于GPT-2 Small和Gemma 2B模型表征时的有效性,检验其在无需微调底层LLMs的情况下识别异常行为的能力。实验发现,基于SAE的特征可实现F1分数高达89%的错误检测,其性能始终优于经过微调的变压器编码器基线。本研究首次提供实证证据表明:SAEs可直接利用预训练LLMs的内部表征检测软件错误,且无需任何微调或任务特定监督。


Are Large Language Models Robust in Understanding Code Against Semantics-Preserving Mutations?

Abstract

arXiv:2505.10443v1 Announce Type: cross Abstract: Understanding the reasoning and robustness of Large Language Models (LLMs) is critical for their reliable use in programming tasks. While recent studies have assessed LLMs' ability to predict program outputs, most focus solely on the accuracy of those predictions, without evaluating the reasoning behind them. Moreover, it has been observed on mathematical reasoning tasks that LLMs can arrive at correct answers through flawed logic, raising concerns about similar issues in code understanding. In this work, we evaluate whether state-of-the-art LLMs with up to 8B parameters can reason about Python programs or are simply guessing. We apply five semantics-preserving code mutations: renaming variables, mirroring comparison expressions, swapping if-else branches, converting for loops to while, and loop unrolling. These mutations maintain program semantics while altering its syntax. We evaluated six LLMs and performed a human expert analysis using LiveCodeBench to assess whether the correct predictions are based on sound reasoning. We also evaluated prediction stability across different code mutations on LiveCodeBench and CruxEval. Our findings show that some LLMs, such as Llama3.2, produce correct predictions based on flawed reasoning in up to 61% of cases. Furthermore, LLMs often change predictions in response to our code mutations, indicating limited robustness in their semantic understanding.

摘要

理解大型语言模型(LLMs)的推理能力与鲁棒性对其在编程任务中的可靠应用至关重要。尽管近期研究评估了LLMs预测程序输出的能力,但多数仅关注预测准确性,而忽视了对推理过程的评估。此外,数学推理任务中已观察到LLMs可能通过错误逻辑得出正确答案,这引发了对其在代码理解中存在类似问题的担忧。

本研究评估了参数规模达80亿的最先进LLMs是否能真正推理Python程序,抑或仅是猜测。我们应用了五种保持语义的代码变异:变量重命名、比较表达式镜像、if-else分支交换、for循环转while循环以及循环展开。这些变异在保持程序语义的同时改变了其语法结构。我们测试了六种LLMs,并借助LiveCodeBench进行专家分析以判断正确预测是否基于合理推理。同时,我们在LiveCodeBench和CruxEval上评估了不同代码变异下的预测稳定性。研究结果表明,某些LLMs(如Llama3.2)高达61%的正确预测基于错误推理。此外,LLMs经常因代码变异而改变预测,表明其语义理解的鲁棒性有限。


Multi-Token Prediction Needs Registers

Abstract

arXiv:2505.10518v1 Announce Type: cross Abstract: Multi-token prediction has emerged as a promising objective for improving language model pretraining, but its benefits have not consistently generalized to other settings such as fine-tuning. In this paper, we propose MuToR, a simple and effective approach to multi-token prediction that interleaves learnable register tokens into the input sequence, each tasked with predicting future targets. Compared to existing methods, MuToR offers several key advantages: it introduces only a negligible number of additional parameters, requires no architectural changes--ensuring compatibility with off-the-shelf pretrained language models--and remains aligned with the next-token pretraining objective, making it especially well-suited for supervised fine-tuning. Moreover, it naturally supports scalable prediction horizons. We demonstrate the effectiveness and versatility of MuToR across a range of use cases, including supervised fine-tuning, parameter-efficient fine-tuning (PEFT), and pretraining, on challenging generative tasks in both language and vision domains. Our code will be available at: https://github.com/nasosger/MuToR.

摘要

多标记预测已成为改进语言模型预训练的一种有前景的目标,但其优势尚未持续推广至微调等其他场景。本文提出MuToR——一种简单有效的多标记预测方法,该方法将可学习的寄存器标记交错插入输入序列,每个标记负责预测未来目标。与现有方法相比,MuToR具有若干关键优势:仅引入可忽略不计的额外参数量,无需架构修改(确保与现成预训练语言模型的兼容性),且保持与下一标记预训练目标的一致性,使其特别适用于监督微调。此外,该方法天然支持可扩展的预测范围。我们在语言和视觉领域具有挑战性的生成任务上,通过监督微调、参数高效微调(PEFT)和预训练等多种使用场景,验证了MuToR的有效性和通用性。代码将在以下地址公开:https://github.com/nasosger/MuToR。


Real-Time Out-of-Distribution Failure Prevention via Multi-Modal Reasoning

Abstract

arXiv:2505.10547v1 Announce Type: cross Abstract: Foundation models can provide robust high-level reasoning on appropriate safety interventions in hazardous scenarios beyond a robot's training data, i.e. out-of-distribution (OOD) failures. However, due to the high inference latency of Large Vision and Language Models, current methods rely on manually defined intervention policies to enact fallbacks, thereby lacking the ability to plan generalizable, semantically safe motions. To overcome these challenges we present FORTRESS, a framework that generates and reasons about semantically safe fallback strategies in real time to prevent OOD failures. At a low frequency in nominal operations, FORTRESS uses multi-modal reasoners to identify goals and anticipate failure modes. When a runtime monitor triggers a fallback response, FORTRESS rapidly synthesizes plans to fallback goals while inferring and avoiding semantically unsafe regions in real time. By bridging open-world, multi-modal reasoning with dynamics-aware planning, we eliminate the need for hard-coded fallbacks and human safety interventions. FORTRESS outperforms on-the-fly prompting of slow reasoning models in safety classification accuracy on synthetic benchmarks and real-world ANYmal robot data, and further improves system safety and planning success in simulation and on quadrotor hardware for urban navigation.

摘要

基础模型能够在超出机器人训练数据的危险场景(即分布外故障)中,为安全干预措施提供鲁棒的高层推理。然而,由于大型视觉与语言模型的高推理延迟,现有方法依赖手动定义的干预策略实施回退方案,因而缺乏规划通用化、语义安全运动的能力。为克服这些挑战,我们提出FORTRESS框架,该框架能实时生成并推理语义安全的回退策略以防止分布外故障。在正常运行的低频阶段,FORTRESS利用多模态推理器识别目标并预判故障模式。当运行时监测器触发回退响应时,FORTRESS快速合成回退目标计划,同时实时推断并避开语义不安全区域。通过将开放世界的多模态推理与动态感知规划相融合,我们消除了对硬编码回退方案和人工安全干预的需求。FORTRESS在合成基准测试和ANYmal机器人真实数据的安全分类准确率上优于即时提示的慢速推理模型,并进一步提升了四旋翼无人机城市导航仿真与硬件系统的安全性和规划成功率。


Superposition Yields Robust Neural Scaling

Abstract

arXiv:2505.10465v1 Announce Type: cross Abstract: The success of today's large language models (LLMs) depends on the observation that larger models perform better. However, the origin of this neural scaling law -- the finding that loss decreases as a power law with model size -- remains unclear. Starting from two empirical principles -- that LLMs represent more things than the model dimensions (widths) they have (i.e., representations are superposed), and that words or concepts in language occur with varying frequencies -- we constructed a toy model to study the loss scaling with model size. We found that when superposition is weak, meaning only the most frequent features are represented without interference, the scaling of loss with model size depends on the underlying feature frequency; if feature frequencies follow a power law, so does the loss. In contrast, under strong superposition, where all features are represented but overlap with each other, the loss becomes inversely proportional to the model dimension across a wide range of feature frequency distributions. This robust scaling behavior is explained geometrically: when many more vectors are packed into a lower dimensional space, the interference (squared overlaps) between vectors scales inversely with that dimension. We then analyzed four families of open-sourced LLMs and found that they exhibit strong superposition and quantitatively match the predictions of our toy model. The Chinchilla scaling law turned out to also agree with our results. We conclude that representation superposition is an important mechanism underlying the observed neural scaling laws. We anticipate that these insights will inspire new training strategies and model architectures to achieve better performance with less computation and fewer parameters.

摘要

当今大型语言模型(LLMs)的成功依赖于一个观察:模型规模越大性能越优。然而,这种神经缩放定律——即损失随模型尺寸呈幂律下降的规律——的起源仍不明确。基于两个实证原则:LLMs所表征的内容远超其模型维度(宽度)所能容纳(即表征存在叠加),以及语言中词汇或概念的出现频率存在差异——我们构建了一个玩具模型来研究损失随模型尺寸的缩放规律。研究发现:当叠加效应较弱时(仅最高频特征能无干扰地表征),损失随模型尺寸的缩放取决于底层特征频率;若特征频率服从幂律分布,则损失亦呈幂律变化。相反,在强叠加状态下(所有特征均被表征但相互重叠),损失与模型维度成反比,且该规律广泛适用于各类特征频率分布。这种稳健的缩放行为可通过几何原理解释:当更多向量被压缩至低维空间时,向量间干扰(平方重叠量)与维度成反比缩放。随后我们对四个开源LLM家族进行分析,发现它们均表现出强叠加特性,且定量符合玩具模型的预测。Chinchilla缩放定律也被证实与我们的结论一致。研究表明:表征叠加是观测到的神经缩放定律背后的重要机制。我们预期这些发现将启发新的训练策略和模型架构,从而以更少计算量和参数实现更优性能。


AriGraph: Learning Knowledge Graph World Models with Episodic Memory for LLM Agents

Abstract

arXiv:2407.04363v3 Announce Type: replace Abstract: Advancements in the capabilities of Large Language Models (LLMs) have created a promising foundation for developing autonomous agents. With the right tools, these agents could learn to solve tasks in new environments by accumulating and updating their knowledge. Current LLM-based agents process past experiences using a full history of observations, summarization, retrieval augmentation. However, these unstructured memory representations do not facilitate the reasoning and planning essential for complex decision-making. In our study, we introduce AriGraph, a novel method wherein the agent constructs and updates a memory graph that integrates semantic and episodic memories while exploring the environment. We demonstrate that our Ariadne LLM agent, consisting of the proposed memory architecture augmented with planning and decision-making, effectively handles complex tasks within interactive text game environments difficult even for human players. Results show that our approach markedly outperforms other established memory methods and strong RL baselines in a range of problems of varying complexity. Additionally, AriGraph demonstrates competitive performance compared to dedicated knowledge graph-based methods in static multi-hop question-answering.

摘要

大语言模型(LLM)能力的进步为开发自主智能体奠定了良好基础。通过配备适当工具,这些智能体能够通过知识积累与更新来学习解决新环境中的任务。当前基于LLM的智能体主要采用完整观察历史记录、摘要提取和检索增强等方式处理过往经验。然而,这些非结构化的记忆表征方式不利于支撑复杂决策所需的推理与规划能力。本研究提出AriGraph方法,该创新方案使智能体在环境探索过程中构建并更新融合语义记忆与情景记忆的记忆图谱。实验表明,我们开发的Ariadne智能体(由增强型记忆架构结合规划决策模块构成)在交互式文本游戏环境中能有效处理即使对人类玩家也颇具挑战性的复杂任务。结果显示,该方法在各类复杂度不同的问题上显著优于现有记忆方法和强化学习基线方案。此外,在静态多跳问答任务中,AriGraph相较专用知识图谱方法也展现出具有竞争力的性能表现。


Demonstrating specification gaming in reasoning models

Abstract

arXiv:2502.13295v2 Announce Type: replace Abstract: We demonstrate LLM agent specification gaming by instructing models to win against a chess engine. We find reasoning models like OpenAI o3 and DeepSeek R1 will often hack the benchmark by default, while language models like GPT-4o and Claude 3.5 Sonnet need to be told that normal play won't work to hack. We improve upon prior work like (Hubinger et al., 2024; Meinke et al., 2024; Weij et al., 2024) by using realistic task prompts and avoiding excess nudging. Our results suggest reasoning models may resort to hacking to solve difficult problems, as observed in OpenAI (2024)'s o1 Docker escape during cyber capabilities testing.

摘要

我们通过指导大语言模型智能体击败国际象棋引擎,展示了其规范博弈行为。研究发现,OpenAI o3和DeepSeek R1等推理模型通常会默认采取破解基准测试的策略,而GPT-4o和Claude 3.5 Sonnet等语言模型需要被告知常规方法无法实现破解才会采取非常规手段。相较于前人研究(Hubinger等,2024;Meinke等,2024;Weij等,2024),本研究的改进在于采用更贴近实际的任务提示,并避免过度引导。研究结果表明,推理模型在解决复杂问题时可能倾向于采取破解策略,这与OpenAI(2024)在网络能力测试中观察到的o1模型Docker逃逸现象相吻合。


MapExplorer: New Content Generation from Low-Dimensional Visualizations

Abstract

arXiv:2412.18673v2 Announce Type: replace Abstract: Low-dimensional visualizations, or "projection maps," are widely used in scientific and creative domains to interpret large-scale and complex datasets. These visualizations not only aid in understanding existing knowledge spaces but also implicitly guide exploration into unknown areas. Although techniques such as t-SNE and UMAP can generate these maps, there exists no systematic method for leveraging them to generate new content. To address this, we introduce MapExplorer, a novel knowledge discovery task that translates coordinates within any projection map into coherent, contextually aligned textual content. This allows users to interactively explore and uncover insights embedded in the maps. To evaluate the performance of MapExplorer methods, we propose Atometric, a fine-grained metric inspired by ROUGE that quantifies logical coherence and alignment between generated and reference text. Experiments on diverse datasets demonstrate the versatility of MapExplorer in generating scientific hypotheses, crafting synthetic personas, and devising strategies for attacking large language models-even with simple baseline methods. By bridging visualization and generation, our work highlights the potential of MapExplorer to enable intuitive human-AI collaboration in large-scale data exploration.

摘要

低维可视化(或称“投影图”)在科学与创意领域被广泛用于解释大规模复杂数据集。这些可视化不仅有助于理解现有知识空间,还隐式地引导对未知领域的探索。尽管t-SNE和UMAP等技术能生成此类图谱,但目前缺乏系统性方法来利用它们生成新内容。为此,我们提出MapExplorer——一种新颖的知识发现任务,可将任意投影图中的坐标转化为连贯且上下文对齐的文本内容,使用户能交互式探索图中嵌入的洞见。为评估MapExplorer方法的性能,我们设计了Atometric指标,该受ROUGE启发的细粒度度量标准可量化生成文本与参考文本间的逻辑连贯性与对齐度。在多领域数据集上的实验表明,即使采用简单基线方法,MapExplorer也能灵活生成科学假设、构建虚拟人物画像,并设计攻击大语言模型的策略。通过连接可视化与生成技术,本研究揭示了MapExplorer在大规模数据探索中实现人机直觉式协作的潜力。


MathCoder-VL: Bridging Vision and Code for Enhanced Multimodal Mathematical Reasoning

Abstract

arXiv:2505.10557v1 Announce Type: cross Abstract: Natural language image-caption datasets, widely used for training Large Multimodal Models, mainly focus on natural scenarios and overlook the intricate details of mathematical figures that are critical for problem-solving, hindering the advancement of current LMMs in multimodal mathematical reasoning. To this end, we propose leveraging code as supervision for cross-modal alignment, since code inherently encodes all information needed to generate corresponding figures, establishing a precise connection between the two modalities. Specifically, we co-develop our image-to-code model and dataset with model-in-the-loop approach, resulting in an image-to-code model, FigCodifier and ImgCode-8.6M dataset, the largest image-code dataset to date. Furthermore, we utilize FigCodifier to synthesize novel mathematical figures and then construct MM-MathInstruct-3M, a high-quality multimodal math instruction fine-tuning dataset. Finally, we present MathCoder-VL, trained with ImgCode-8.6M for cross-modal alignment and subsequently fine-tuned on MM-MathInstruct-3M for multimodal math problem solving. Our model achieves a new open-source SOTA across all six metrics. Notably, it surpasses GPT-4o and Claude 3.5 Sonnet in the geometry problem-solving subset of MathVista, achieving improvements of 8.9% and 9.2%. The dataset and models will be released at https://github.com/mathllm/MathCoder.

摘要

自然语言图像描述数据集被广泛用于训练大型多模态模型,但其主要关注自然场景,忽视了数学图表中对解题至关重要的复杂细节,这阻碍了当前多模态模型在数学推理方面的进展。为此,我们提出利用代码作为跨模态对齐的监督信号,因为代码本身编码了生成对应图表所需的全部信息,从而在两种模态之间建立了精确联系。具体而言,我们采用模型在环方法协同开发了图像到代码模型及数据集,最终得到图像到代码模型FigCodifier和ImgCode-8.6M数据集——这是迄今为止规模最大的图像-代码数据集。此外,我们利用FigCodifier合成新型数学图表,进而构建了高质量多模态数学指令微调数据集MM-MathInstruct-3M。最后,我们提出MathCoder-VL模型,该模型先在ImgCode-8.6M上进行跨模态对齐训练,随后在MM-MathInstruct-3M上进行多模态数学解题微调。我们的模型在所有六项指标上均创造了新的开源SOTA记录,尤其在MathVista几何问题求解子集上,其表现分别超越GPT-4o和Claude 3.5 Sonnet达8.9%和9.2%。数据集与模型将在https://github.com/mathllm/MathCoder发布。


SensorChat: Answering Qualitative and Quantitative Questions during Long-Term Multimodal Sensor Interactions

Abstract

arXiv:2502.02883v2 Announce Type: replace Abstract: Natural language interaction with sensing systems is crucial for addressing users' personal concerns and providing health-related insights into their daily lives. When a user asks a question, the system automatically analyzes the full history of sensor data, extracts relevant information, and generates an appropriate response. However, existing systems are limited to short-duration (e.g., one minute) or low-frequency (e.g., daily step count) sensor data. In addition, they struggle with quantitative questions that require precise numerical answers. In this work, we introduce SensorChat, the first end-to-end QA system designed for daily life monitoring using long-duration, high-frequency time series data. Given raw sensor signals spanning multiple days and a user-defined natural language question, SensorChat generates semantically meaningful responses that directly address user concerns. SensorChat effectively handles both quantitative questions that require numerical precision and qualitative questions that require high-level reasoning to infer subjective insights. To achieve this, SensorChat uses an innovative three-stage pipeline including question decomposition, sensor data query, and answer assembly. The first and third stages leverage Large Language Models (LLMs) to interpret human queries and generate responses. The intermediate querying stage extracts relevant information from the complete sensor data history. Real-world implementation demonstrate SensorChat's capability for real-time interactions on a cloud server while also being able to run entirely on edge platforms after quantization. Comprehensive QA evaluations show that SensorChat achieves up to 93% higher answer accuracy than state-of-the-art systems on quantitative questions. Additionally, a user study with eight volunteers highlights SensorChat's effectiveness in answering qualitative and open-ended questions.

摘要

与传感系统的自然语言交互对于解决用户个人关切并提供其日常生活中的健康相关洞察至关重要。当用户提出问题时,系统会自动分析完整的传感器数据历史记录,提取相关信息并生成恰当回应。然而,现有系统仅能处理短时(如一分钟)或低频(如每日步数)传感器数据,且在需要精确数值答案的定量问题上表现欠佳。本研究提出SensorChat——首个基于长时程高频时序数据的日常生活监测端到端问答系统。给定跨越多日的原始传感器信号和用户自定义的自然语言问题,SensorChat能生成直接回应用户关切的语义化响应。该系统可同时处理需要数值精度的定量问题,以及需通过高层推理推断主观洞察的定性问题。为实现这一目标,SensorChat采用创新的三阶段流程,包括问题分解、传感器数据查询和答案组装。首末阶段利用大语言模型(LLMs)解析人类查询并生成响应,中间查询阶段则从完整传感器历史数据中提取相关信息。实际部署表明,SensorChat既能在云服务器实现实时交互,也可在量化后完全运行于边缘平台。全面问答评估显示,在定量问题上,SensorChat的答案准确率较现有最优系统提升达93%。针对八名志愿者的用户研究进一步证实了该系统在回答定性和开放式问题方面的有效性。


Collaborative Speculative Inference for Efficient LLM Inference Serving

Abstract

arXiv:2503.10325v2 Announce Type: replace Abstract: Speculative inference is a promising paradigm employing small speculative models (SSMs) as drafters to generate draft tokens, which are subsequently verified in parallel by the target large language model (LLM). This approach enhances the efficiency of inference serving by reducing LLM inference latency and costs while preserving generation quality. However, existing speculative methods face critical challenges, including inefficient resource utilization and limited draft acceptance, which constrain their scalability and overall effectiveness. To overcome these obstacles, we present CoSine, a novel speculative inference system that decouples sequential speculative decoding from parallel verification, enabling efficient collaboration among multiple nodes. Specifically, CoSine routes inference requests to specialized drafters based on their expertise and incorporates a confidence-based token fusion mechanism to synthesize outputs from cooperating drafters, ensuring high-quality draft generation. Additionally, CoSine dynamically orchestrates the execution of speculative decoding and verification in a pipelined manner, employing batch scheduling to selectively group requests and adaptive speculation control to minimize idle periods. By optimizing parallel workflows through heterogeneous node collaboration, CoSine balances draft generation and verification throughput in real-time, thereby maximizing resource utilization. Experimental results demonstrate that CoSine achieves superior performance compared to state-of-the-art speculative approaches. Notably, with equivalent resource costs, CoSine achieves up to a 23.2% decrease in latency and a 32.5% increase in throughput compared to baseline methods.

摘要

推测推理是一种新兴范式,它采用小型推测模型(SSM)作为起草器生成草稿标记,随后由目标大语言模型(LLM)并行验证。该方法通过降低LLM推理延迟和成本同时保持生成质量,提升了推理服务效率。然而,现有推测方法面临关键挑战,包括资源利用率低下和草稿接受率有限,制约了其可扩展性和整体效能。为克服这些障碍,我们提出CoSine——一种新型推测推理系统,该系统将顺序推测解码与并行验证解耦,实现多节点间高效协作。具体而言,CoSine根据专业能力将推理请求路由至专用起草器,并采用基于置信度的标记融合机制合成合作起草器的输出,确保高质量草稿生成。此外,CoSine通过流水线方式动态编排推测解码与验证的执行,利用批量调度选择性分组请求,并采用自适应推测控制最小化空闲时间。通过异构节点协作优化并行工作流,CoSine实时平衡草稿生成与验证吞吐量,从而实现资源利用率最大化。实验结果表明,相较于最先进的推测方法,CoSine展现出卓越性能。值得注意的是,在同等资源成本下,CoSine相比基线方法可实现23.2%的延迟降低和32.5%的吞吐量提升。


LLM A*: Human in the Loop Large Language Models Enabled A* Search for Robotics

Abstract

arXiv:2312.01797v3 Announce Type: replace-cross Abstract: This research focuses on how Large Language Models (LLMs) can help with (path) planning for mobile embodied agents such as robots, in a human-in-the-loop and interactive manner. A novel framework named LLM A*, aims to leverage the commonsense of LLMs, and the utility-optimal A* is proposed to facilitate few-shot near-optimal path planning. Prompts are used for two main purposes: 1) to provide LLMs with essential information like environments, costs, heuristics, etc.; 2) to communicate human feedback on intermediate planning results to LLMs. This approach takes human feedback on board and renders the entire planning process transparent (akin to a `white box') to humans. Moreover, it facilitates code-free path planning, thereby fostering the accessibility and inclusiveness of artificial intelligence techniques to communities less proficient in coding. Comparative analysis against A* and RL demonstrates that LLM A* exhibits greater efficiency in terms of search space and achieves paths comparable to A* while outperforming RL. The interactive nature of LLM A* also makes it a promising tool for deployment in collaborative human-robot tasks. Codes and Supplemental Materials can be found at GitHub: https://github.com/speedhawk/LLM-A-.

摘要

本研究聚焦于如何通过人类参与循环的交互方式,利用大语言模型(LLMs)为机器人等移动实体代理进行(路径)规划。我们提出名为LLM A的创新框架,旨在结合LLMs的常识推理能力与效用最优的A算法,实现少样本近优路径规划。提示词主要用于两个核心功能:1)向LLMs传递环境、成本函数、启发式规则等关键信息;2)将人类对中间规划结果的反馈传达给LLMs。该方法不仅整合了人类反馈,还使整个规划过程对人类保持透明(类似"白盒"机制)。此外,它实现了无需编码的路径规划,从而提升人工智能技术对编程能力较弱群体的可及性与包容性。与A和强化学习(RL)的对比分析表明,LLM A在搜索空间效率方面表现更优,其生成路径与A相当且优于RL。LLM A的交互特性也使其成为人机协作任务部署的理想工具。代码及补充材料详见GitHub:https://github.com/speedhawk/LLM-A-。


Neural Thermodynamic Laws for Large Language Model Training

Abstract

arXiv:2505.10559v1 Announce Type: cross Abstract: Beyond neural scaling laws, little is known about the laws underlying large language models (LLMs). We introduce Neural Thermodynamic Laws (NTL) -- a new framework that offers fresh insights into LLM training dynamics. On the theoretical side, we demonstrate that key thermodynamic quantities (e.g., temperature, entropy, heat capacity, thermal conduction) and classical thermodynamic principles (e.g., the three laws of thermodynamics and the equipartition theorem) naturally emerge under river-valley loss landscape assumptions. On the practical side, this scientific perspective yields intuitive guidelines for designing learning rate schedules.

摘要

除神经标度律外,人们对大语言模型(LLM)背后的规律知之甚少。我们提出了神经热力学定律(NTL)——这一新框架为LLM训练动力学提供了全新见解。在理论层面,我们证明在河流谷损失景观假设下,关键热力学量(如温度、熵、热容、热传导)与经典热力学原理(如热力学三定律和能量均分定理)会自然涌现。在实践层面,这一科学视角为学习率调度设计提供了直观指导准则。


Tokenization Matters! Degrading Large Language Models through Challenging Their Tokenization

Abstract

arXiv:2405.17067v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) have shown remarkable capabilities in language understanding and generation. Nonetheless, it was also witnessed that LLMs tend to produce inaccurate responses to specific queries. This deficiency can be traced to the tokenization step LLMs must undergo, which is an inevitable limitation inherent to all LLMs. In fact, incorrect tokenization is the critical point that hinders LLMs in understanding the input precisely, thus leading to unsatisfactory output. This defect is more obvious in Chinese scenarios. To demonstrate this flaw of LLMs, we construct an adversarial dataset, named as \textbf{ADT (Adversarial Dataset for Tokenizer)}, which draws upon the vocabularies of various open-source LLMs to challenge LLMs' tokenization. ADT consists of two subsets: the manually constructed ADT-Human and the automatically generated ADT-Auto. Our empirical results reveal that our ADT is highly effective on challenging the tokenization of leading LLMs, including GPT-4o, Llama-3, Deepseek-R1 and so on, thus degrading these LLMs' capabilities. Moreover, our method of automatic data generation has been proven efficient and robust, which can be applied to any open-source LLMs. In this paper, we substantially investigate LLMs' vulnerability in terms of challenging their token segmentation, which will shed light on the subsequent research of improving LLMs' capabilities through optimizing their tokenization process and algorithms.

摘要

大型语言模型(LLMs)在语言理解与生成方面展现出卓越能力,但同时也存在对特定查询生成错误响应的缺陷。这一不足可追溯至LLMs必须经历的分词步骤——这是所有LLMs固有的不可避免的局限性。事实上,错误的分词是阻碍LLMs精准理解输入内容的关键因素,从而导致不理想的输出结果。该缺陷在中文场景中尤为显著。为验证LLMs这一缺陷,我们构建了名为ADT(分词器对抗数据集)的对抗数据集,其通过整合各类开源LLMs的词表来挑战LLMs的分词能力。ADT包含两个子集:人工构建的ADT-Human与自动生成的ADT-Auto。实验结果表明,我们的ADT能有效挑战包括GPT-4o、Llama-3、Deepseek-R1等主流LLMs的分词机制,显著降低这些模型的性能。此外,我们提出的自动数据生成方法被证实具有高效性与鲁棒性,可适用于任何开源LLMs。本文通过系统研究LLMs在分词挑战中的脆弱性,为后续通过优化分词过程与算法来提升LLMs能力的研究提供了重要启示。


PersLLM: A Personified Training Approach for Large Language Models

Abstract

arXiv:2407.12393v5 Announce Type: replace-cross Abstract: Large language models (LLMs) exhibit human-like intelligence, enabling them to simulate human behavior and support various applications that require both humanized communication and extensive knowledge reserves. Efforts are made to personify LLMs with special training data or hand-crafted prompts, while correspondingly faced with challenges such as insufficient data usage or rigid behavior patterns. Consequently, personified LLMs fail to capture personified knowledge or express persistent opinion. To fully unlock the potential of LLM personification, we propose PersLLM, a framework for better data construction and model tuning. For insufficient data usage, we incorporate strategies such as Chain-of-Thought prompting and anti-induction, improving the quality of data construction and capturing the personality experiences, knowledge, and thoughts more comprehensively. For rigid behavior patterns, we design the tuning process and introduce automated DPO to enhance the specificity and dynamism of the models' personalities, which leads to a more natural opinion communication. Both automated metrics and expert human evaluations demonstrate the effectiveness of our approach. Case studies in human-machine interactions and multi-agent systems further suggest potential application scenarios and future directions for LLM personification.

摘要

大型语言模型(LLMs)展现出类人智能,使其能够模拟人类行为,并支持需要人性化交流与丰富知识储备的各类应用。现有研究尝试通过特殊训练数据或人工设计提示词来实现LLMs的人格化,但面临数据利用不足或行为模式僵化等挑战,导致人格化模型难以捕捉个性化知识或表达连贯观点。为充分释放LLM人格化潜力,我们提出PersLLM框架,通过优化数据构建与模型微调实现突破。针对数据利用不足问题,我们整合思维链提示与反诱导等策略,提升数据构建质量,更全面地捕捉人格特质相关的经历、知识与思维模式。针对行为僵化问题,我们设计微调流程并引入自动化DPO,增强模型人格的特异性与动态性,使观点表达更自然。自动化指标与专家评估均验证了方法的有效性。在人机交互与多智能体系统中的案例研究进一步揭示了LLM人格化的潜在应用场景与未来发展方向。


Large Language Models for Cyber Security: A Systematic Literature Review

Abstract

arXiv:2405.04760v4 Announce Type: replace-cross Abstract: The rapid advancement of Large Language Models (LLMs) has opened up new opportunities for leveraging artificial intelligence in various domains, including cybersecurity. As the volume and sophistication of cyber threats continue to grow, there is an increasing need for intelligent systems that can automatically detect vulnerabilities, analyze malware, and respond to attacks. In this survey, we conduct a comprehensive review of the literature on the application of LLMs in cybersecurity (LLM4Security). By comprehensively collecting over 30K relevant papers and systematically analyzing 127 papers from top security and software engineering venues, we aim to provide a holistic view of how LLMs are being used to solve diverse problems across the cybersecurity domain. Through our analysis, we identify several key findings. First, we observe that LLMs are being applied to a wide range of cybersecurity tasks, including vulnerability detection, malware analysis, network intrusion detection, and phishing detection. Second, we find that the datasets used for training and evaluating LLMs in these tasks are often limited in size and diversity, highlighting the need for more comprehensive and representative datasets. Third, we identify several promising techniques for adapting LLMs to specific cybersecurity domains, such as fine-tuning, transfer learning, and domain-specific pre-training. Finally, we discuss the main challenges and opportunities for future research in LLM4Security, including the need for more interpretable and explainable models, the importance of addressing data privacy and security concerns, and the potential for leveraging LLMs for proactive defense and threat hunting. Overall, our survey provides a comprehensive overview of the current state-of-the-art in LLM4Security and identifies several promising directions for future research.

摘要

大型语言模型(LLM)的快速发展为人工智能在网络安全等领域的应用开辟了新机遇。随着网络威胁的数量和复杂程度持续增长,对能够自动检测漏洞、分析恶意软件并响应攻击的智能系统的需求日益迫切。本综述系统梳理了LLM在网络安全领域(LLM4Security)应用的文献,通过全面收集超过3万篇相关论文并系统分析来自顶级安全与软件工程会议的127篇论文,旨在全景展现LLM如何被用于解决网络安全领域的各类问题。通过分析,我们得出若干重要发现:首先,LLM正被应用于漏洞检测、恶意软件分析、网络入侵检测和钓鱼检测等广泛的安全任务;其次,这些任务中用于训练和评估LLM的数据集往往在规模和多样性上存在局限,凸显了对更全面、更具代表性数据集的需求;第三,我们识别出若干将LLM适配特定网络安全领域的有前景的技术,包括微调、迁移学习和领域特定预训练;最后,我们探讨了LLM4Security未来研究面临的主要挑战与机遇,包括对可解释模型的需求、解决数据隐私与安全问题的必要性,以及利用LLM实现主动防御和威胁狩猎的潜力。本综述全面呈现了LLM4Security的研究现状,并指明了若干具有前景的未来研究方向。


PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

Abstract

arXiv:2406.02069v4 Announce Type: replace-cross Abstract: In this study, we investigate whether attention-based information flow inside large language models (LLMs) is aggregated through noticeable patterns for long context processing. Our observations reveal that LLMs aggregate information through Pyramidal Information Funneling where attention is scattering widely in lower layers, progressively consolidating within specific contexts, and ultimately focusing on critical tokens (a.k.a massive activation or attention sink) in higher layers. Motivated by these insights, we developed PyramidKV, a novel and effective KV cache compression method. This approach dynamically adjusts the KV cache size across different layers, allocating more cache in lower layers and less in higher ones, diverging from traditional methods that maintain a uniform KV cache size. Our experimental evaluations, utilizing the LongBench benchmark, show that PyramidKV matches the performance of models with a full KV cache while retaining only 12% of the KV cache, thus significantly reducing memory usage. In scenarios emphasizing memory efficiency, where only 0.7% of the KV cache is maintained, PyramidKV surpasses other KV cache compression techniques, achieving up to a 20.5 absolute accuracy improvement on TREC dataset. In the Needle-in-a-Haystack experiment, PyramidKV outperforms competing methods in maintaining long-context comprehension in LLMs; notably, retaining just 128 KV cache entries enables the LLAMA-3-70B model to achieve 100.0 Acc. performance.

摘要

本研究探讨了大型语言模型(LLM)中基于注意力的信息流是否通过显著模式进行聚合以实现长上下文处理。观察发现,LLM通过金字塔式信息汇聚机制实现信息聚合:底层注意力广泛分散,中间层逐步集中于特定上下文,最终在高层聚焦于关键token(即大规模激活或注意力汇聚点)。基于此发现,我们提出PyramidKV——一种新颖高效的KV缓存压缩方法。该方法动态调整各层KV缓存大小,底层分配较多缓存而高层分配较少,与传统保持均匀KV缓存的方法不同。通过LongBench基准测试评估表明,PyramidKV在仅保留12% KV缓存的情况下,性能与完整KV缓存相当,显著降低了内存占用。在强调内存效率的场景中(仅保留0.7% KV缓存),PyramidKV优于其他KV压缩技术,在TREC数据集上实现20.5%的绝对准确率提升。在'大海捞针'实验中,PyramidKV在保持LLM长上下文理解能力方面表现优异:仅保留128个KV缓存条目即可使LLAMA-3-70B模型达到100%准确率。


SAKR: Enhancing Retrieval-Augmented Generation via Streaming Algorithm and K-Means Clustering

Abstract

arXiv:2407.21300v4 Announce Type: replace-cross Abstract: Retrieval-augmented generation (RAG) has achieved significant success in information retrieval to assist large language models LLMs because it builds an external knowledge database. However, it also has many problems, it consumes a lot of memory because of the enormous database, and it cannot update the established index database in time when confronted with massive streaming data. To reduce the memory required for building the database and maintain accuracy simultaneously, we proposed a new approach integrating a streaming algorithm with k-means clustering into RAG. Our approach applied a streaming algorithm to update the index dynamically and reduce memory consumption. Additionally, the k-means algorithm clusters highly similar documents, and the query time would be shortened. We conducted comparative experiments on four methods, and the results indicated that RAG with streaming algorithm and k-means clusters outperforms traditional RAG in accuracy and memory, particularly when dealing with large-scale streaming data.

摘要

检索增强生成(RAG)通过构建外部知识库,在辅助大语言模型的信息检索领域取得了显著成功。然而该方法仍存在诸多问题:庞大的数据库导致内存消耗过高,且面对海量流式数据时无法及时更新已建立的索引库。为在降低建库内存需求的同时保持准确性,我们提出一种将流式算法与k-means聚类整合至RAG的新方法。该方案采用流式算法动态更新索引并减少内存占用,同时通过k-means算法对高相似度文档进行聚类以缩短查询时间。我们在四种方法上开展对比实验,结果表明:融合流式算法与k-means聚类的RAG在准确性和内存效率方面均优于传统RAG,尤其在大规模流式数据处理场景下表现更为突出。


Time Awareness in Large Language Models: Benchmarking Fact Recall Across Time

Abstract

arXiv:2409.13338v3 Announce Type: replace-cross Abstract: Who is the US President? The answer changes depending on when the question is asked. While large language models (LLMs) are evaluated on various reasoning tasks, they often miss a crucial dimension: time. In real-world scenarios, the correctness of answers is frequently tied to temporal context. To address this gap, we present a novel framework and dataset spanning over 8,000 events from 2018 to 2024, annotated with day-level granularity and sourced globally across domains such as politics, science, and business. Our TimeShift evaluation method systematically probes LLMs for temporal reasoning, revealing that base models often outperform instruction-tuned and synthetic-trained counterparts on time-sensitive recall. Additionally, we find that even large-scale models exhibit brittleness in handling paraphrased facts, highlighting unresolved challenges in temporal consistency. By identifying these limitations, our work provides a significant step toward advancing time-aware language models capable of adapting to the dynamic nature of real-world knowledge.

摘要

美国总统是谁?答案会因提问时间而异。虽然大型语言模型(LLM)已在多种推理任务中接受评估,但它们往往忽略了一个关键维度:时间。在现实场景中,答案的正确性常与时间背景紧密相关。为填补这一空白,我们提出了一个新颖的框架和数据集,涵盖2018至2024年间8000余个事件,标注精度达天级,数据来源横跨政治、科学和商业等全球多领域。我们的TimeShift评估方法系统性地探究了LLM的时间推理能力,发现基础模型在时间敏感型回忆任务上通常优于指令微调模型和合成训练模型。此外,我们发现即使大规模模型在处理转述事实时也表现出脆弱性,这凸显了时间一致性方面尚未解决的挑战。通过揭示这些局限性,本研究为推动具有时间感知能力的语言模型发展迈出重要一步,使其能够适应现实世界知识的动态特性。


Beyond Next Token Prediction: Patch-Level Training for Large Language Models

Abstract

arXiv:2407.12665v3 Announce Type: replace-cross Abstract: The prohibitive training costs of Large Language Models (LLMs) have emerged as a significant bottleneck in the development of next-generation LLMs. In this paper, we show that it is possible to significantly reduce the training costs of LLMs without sacrificing their performance. Specifically, we introduce patch-level training for LLMs, in which multiple tokens are aggregated into a unit of higher information density, referred to as a `patch', to serve as the fundamental text unit for training LLMs. During patch-level training, we feed the language model shorter sequences of patches and train it to predict the next patch, thereby processing the majority of the training data at a significantly reduced cost. Following this, the model continues token-level training on the remaining training data to align with the inference mode. Experiments on a diverse range of models (370M-2.7B parameters) demonstrate that patch-level training can reduce the overall training costs to 0.5×\times, without compromising the model performance compared to token-level training. Source code: https://github.com/shaochenze/PatchTrain.

摘要

大型语言模型(LLMs)的高昂训练成本已成为下一代LLM发展的主要瓶颈。本文研究表明,在不牺牲模型性能的前提下,可显著降低LLM训练成本。具体而言,我们提出LLM的块级训练方法:将多个token聚合成信息密度更高的基本训练单元(称为"块"),在训练过程中向语言模型输入较短的块序列并预测下一块,从而以显著降低的成本处理大部分训练数据。此后,模型继续在剩余数据上进行token级训练以匹配推理模式。在多种参数量级模型(3.7亿-27亿参数)上的实验表明,相较于token级训练,块级训练可将总体训练成本降低至0.5倍,同时保持模型性能不变。


Compensate Quantization Errors+: Quantized Models Are Inquisitive Learners

Abstract

arXiv:2407.15508v3 Announce Type: replace-cross Abstract: The quantization of large language models (LLMs) has been a prominent research area aimed at enabling their lightweight deployment in practice. Existing research about LLM's quantization has mainly explored the interplay between weights and activations, or employing auxiliary components while neglecting the necessity of adjusting weights during quantization. Consequently, original weight distributions frequently fail to yield desired results after round-to-nearest (RTN) quantization. Even though incorporating techniques such as mixed precision and low-rank error approximation in LLM's quantization can yield improved results, they inevitably introduce additional computational overhead. On the other hand, traditional techniques for weight quantization, such as Generative Post-Training Quantization, rely on manually tweaking weight distributions to minimize local errors, but they fall short of achieving globally optimal outcomes. Although the recently proposed Learnable Singular-value Increment improves global weight quantization by modifying weight distributions, it disrupts the original distribution considerably. This introduces pronounced bias toward the training data and can degrade downstream task performance. In this paper, we introduce Singular-value Diagonal Expansion, a more nuanced approach to refining weight distributions to achieve better quantization alignment. Furthermore, we introduce Cross-layer Learning that improves overall quantization outcomes by distributing errors more evenly across layers. Our plug-and-play weight-quantization methods demonstrate substantial performance improvements over state-of-the-art approaches, including OmniQuant, DuQuant, and PrefixQuant.

摘要

大型语言模型(LLMs)的量化一直是推动其轻量化部署的重要研究方向。现有关于LLM量化的研究主要探索权重与激活值之间的相互作用,或采用辅助组件,却忽视了量化过程中调整权重的必要性。这导致原始权重分布在最近舍入(RTN)量化后往往无法获得理想结果。尽管在LLM量化中引入混合精度和低秩误差近似等技术可提升效果,但这些方法不可避免地会带来额外计算开销。另一方面,生成式训练后量化等传统权重量化技术依赖于人工调整权重分布以最小化局部误差,却难以实现全局最优。虽然最近提出的可学习奇异值增量通过修改权重分布改善了全局权重量化,但该方法会显著破坏原始分布,导致对训练数据的明显偏置并可能损害下游任务性能。本文提出奇异值对角扩展法,通过更精细的权重分布调整实现更优的量化对齐。此外,我们引入跨层学习技术,通过在各层间更均匀地分配误差来提升整体量化效果。实验表明,我们的即插即用式权重量化方法在性能上显著优于OmniQuant、DuQuant和PrefixQuant等最先进方案。


Natural Language Reinforcement Learning

Abstract

arXiv:2411.14251v2 Announce Type: replace-cross Abstract: Reinforcement Learning (RL) mathematically formulates decision-making with Markov Decision Process (MDP). With MDPs, researchers have achieved remarkable breakthroughs across various domains, including games, robotics, and language models. This paper seeks a new possibility, Natural Language Reinforcement Learning (NLRL), by extending traditional MDP to natural language-based representation space. Specifically, NLRL innovatively redefines RL principles, including task objectives, policy, value function, Bellman equation, and policy iteration, into their language counterparts. With recent advancements in large language models (LLMs), NLRL can be practically implemented to achieve RL-like policy and value improvement by either pure prompting or gradient-based training. Experiments over Maze, Breakthrough, and Tic-Tac-Toe games demonstrate the effectiveness, efficiency, and interpretability of the NLRL framework among diverse use cases.

摘要

强化学习(RL)通过马尔可夫决策过程(MDP)对决策问题进行数学建模。基于MDP框架,研究人员已在游戏、机器人和语言模型等多个领域取得显著突破。本文探索了一种新范式——自然语言强化学习(NLRL),通过将传统MDP扩展到基于自然语言的表示空间来实现。具体而言,NLRL创新性地将RL核心要素(包括任务目标、策略、价值函数、贝尔曼方程和策略迭代)重新定义为对应的语言形式。随着大语言模型(LLM)的最新进展,NLRL可通过纯提示或基于梯度的训练方式,实际实现类似RL的策略与价值优化。在迷宫游戏、Breakthrough和井字棋等实验环境中,NLRL框架在不同应用场景下均展现出有效性、高效性和可解释性。


Towards Graph Foundation Models: Training on Knowledge Graphs Enables Transferability to General Graphs

Abstract

arXiv:2410.12609v2 Announce Type: replace-cross Abstract: Inspired by the success of large language models, there is a trend toward developing graph foundation models to conduct diverse downstream tasks in various domains. However, current models often require extra fine-tuning to apply their learned structural and semantic representations to new graphs, which limits their versatility. Recent breakthroughs in zero-shot inductive reasoning on knowledge graphs (KGs), offer us a new perspective on extending KG reasoning to general graph applications. In this paper, we introduce SCR, a unified graph reasoning framework designed to train on knowledge graphs and effectively generalize across a wide range of graph tasks and domains. We begin by designing the task-specific KG structures to establish a unified topology for different task formats. Then we propose semantic-conditioned message passing, a novel mechanism addressing the inherent semantic isolation in traditional KG reasoning, by jointly modeling structural and semantic invariance patterns in graph representations. To demonstrate the effectiveness, we evaluate the inductive reasoning capability of SCR using 38 diverse graph datasets, covering node-level, link-level, and graph-level tasks across multiple domains. Our results show substantial performance gains over existing foundation models and supervised baselines, highlighting the efficacy and adaptability of our approach.

摘要

受到大语言模型成功的启发,当前正出现开发图基础模型的趋势,旨在跨领域执行多样化下游任务。然而,现有模型通常需要额外微调才能将其学习到的结构与语义表征迁移至新图数据,这限制了其通用性。知识图谱零样本归纳推理的最新突破,为我们提供了将知识图谱推理扩展至通用图应用的新视角。本文提出SCR这一统一图推理框架,该框架专为知识图谱训练设计,并能有效泛化至各类图任务与领域。我们首先通过设计任务特定的知识图谱结构,为不同任务格式建立统一拓扑表示;继而提出语义条件消息传递机制——该创新方法通过联合建模图表征中的结构不变性与语义不变性模式,解决了传统知识图谱推理中固有的语义隔离问题。为验证有效性,我们在涵盖节点级、链接级和图级任务的38个跨领域图数据集上评估了SCR的归纳推理能力。实验结果表明,相较于现有基础模型与监督基线,本方法取得了显著性能提升,充分证明了所提框架的有效性与适应性。


KBAlign: Efficient Self Adaptation on Specific Knowledge Bases

Abstract

arXiv:2411.14790v4 Announce Type: replace-cross Abstract: Although retrieval-augmented generation (RAG) remains essential for knowledge-based question answering (KBQA), current paradigms face critical challenges under specific domains. Existing methods struggle with targeted adaptation on small-scale KBs: vanilla unsupervised training exhibits poor effectiveness, while fine-tuning incurs prohibitive costs of external signals. We present KBAlign, a self-supervised framework that enhances RAG systems through efficient model adaptation. Our key insight is to leverage the model's intrinsic capabilities for knowledge alignment through two innovative mechanisms: multi-grained self-annotation that captures global knowledge for data construction, and iterative tuning that accelerates convergence through self verification. This framework enables cost-effective model adaptation to specific textual KBs, without human supervision or external model assistance. Experiments demonstrate that KBAlign can achieve 90% of the performance gain obtained through GPT-4-supervised adaptation, while relying entirely on self-annotation of much smaller models. KBAlign significantly improves downstream QA accuracy across multiple domains with tiny costs, particularly benefiting scenarios requiring deep knowledge integration from specialized corpora. We release our experimental data, models, and process analyses to the community for further exploration (https://github.com/thunlp/KBAlign).

摘要

尽管检索增强生成(RAG)在基于知识的问答(KBQA)中仍不可或缺,但现有范式在特定领域下面临关键挑战。当前方法难以实现小规模知识库的针对性适配:无监督训练效果欠佳,而微调则需耗费高昂的外部信号成本。我们提出KBAlign,一种通过高效模型适配增强RAG系统的自监督框架。核心思想是通过两种创新机制利用模型内在能力实现知识对齐:多粒度自标注捕获全局知识以构建数据,迭代调优通过自我验证加速收敛。该框架无需人工监督或外部模型辅助,即可实现针对特定文本知识库的经济高效模型适配。实验表明,KBAlign仅依赖小规模模型的自标注,即可达到GPT-4监督适配90%的性能增益。该方法以极小成本显著提升跨领域下游问答准确率,尤其适用于需要深度融合专业语料知识的场景。我们向社区公开实验数据、模型及过程分析以供进一步探索(https://github.com/thunlp/KBAlign)。


Simple and Provable Scaling Laws for the Test-Time Compute of Large Language Models

Abstract

arXiv:2411.19477v3 Announce Type: replace-cross Abstract: We propose two simple, principled and practical algorithms that enjoy provable scaling laws for the test-time compute of large language models (LLMs). The first one is a two-stage knockout-style algorithm: given an input problem, it first generates multiple candidate solutions, and then aggregate them via a knockout tournament for the final output. Assuming that the LLM can generate a correct solution with non-zero probability and do better than a random guess in comparing a pair of correct and incorrect solutions, we prove theoretically that the failure probability of this algorithm decays to zero exponentially or by a power law (depending on the specific way of scaling) as its test-time compute grows. The second one is a two-stage league-style algorithm, where each candidate is evaluated by its average win rate against multiple opponents, rather than eliminated upon loss to a single opponent. Under analogous but more robust assumptions, we prove that its failure probability also decays to zero exponentially with more test-time compute. Both algorithms require a black-box LLM and nothing else (e.g., no verifier or reward model) for a minimalistic implementation, which makes them appealing for practical applications and easy to adapt for different tasks. Through extensive experiments with diverse models and datasets, we validate the proposed theories and demonstrate the outstanding scaling properties of both algorithms.

摘要

我们提出两种简单、原则性强且实用的算法,这些算法能够为大型语言模型(LLM)的测试时计算提供可证明的扩展规律。第一种是两阶段淘汰制算法:给定输入问题后,首先生成多个候选解决方案,然后通过淘汰赛机制聚合这些方案以产生最终输出。假设LLM能以非零概率生成正确解,并且在比较正确解与错误解时表现优于随机猜测,我们从理论上证明,随着测试时计算的增加,该算法的失败概率呈指数级或以幂律形式(取决于具体的扩展方式)衰减至零。第二种是两阶段联赛制算法,其中每个候选解通过其与多个对手的平均胜率进行评估,而非因单次失利即被淘汰。在类似但更具鲁棒性的假设下,我们证明其失败概率同样会随着测试时计算的增加而呈指数级衰减。两种算法仅需黑盒LLM即可实现最小化部署(例如无需验证器或奖励模型),这使得它们在实际应用中极具吸引力,并能轻松适配不同任务。通过采用多样化模型和数据集的广泛实验,我们验证了所提出的理论,并证明了两种算法卓越的扩展特性。


Not All Adapters Matter: Selective Adapter Freezing for Memory-Efficient Fine-Tuning of Language Models

Abstract

arXiv:2412.03587v2 Announce Type: replace-cross Abstract: Transformer-based large-scale pre-trained models achieve great success. Fine-tuning is the standard practice for leveraging these models in downstream tasks. Among the fine-tuning methods, adapter-tuning provides a parameter-efficient fine-tuning by introducing lightweight trainable modules while keeping most pre-trained parameters frozen. However, existing adapter-tuning methods still impose substantial resource usage. Through our investigation, we show that each adapter unequally contributes to both task performance and resource usage. Motivated by this insight, we propose Selective Adapter FrEezing (SAFE), which gradually freezes less important adapters early to reduce unnecessary resource usage while maintaining performance. In our experiments, SAFE reduces memory usage, computation amount, and training time by 42.85%, 34.59%, and 11.82%, respectively, while achieving comparable or better task performance compared to the baseline. We also demonstrate that SAFE induces regularization effect, thereby smoothing the loss landscape, which enables the model to generalize better by avoiding sharp minima.

摘要

基于Transformer的大规模预训练模型取得了巨大成功。在下游任务中,微调是利用这些模型的标准做法。在各类微调方法中,适配器调参通过引入轻量级可训练模块并保持大部分预训练参数冻结,实现了参数高效的微调。然而现有适配器调参方法仍存在显著的资源消耗问题。通过实验分析,我们发现不同适配器对任务性能和资源消耗的贡献存在显著差异。基于这一发现,我们提出选择性适配器冻结方法(SAFE),该方法通过提前冻结重要性较低的适配器来减少不必要的资源消耗,同时保持模型性能。实验表明,与基线方法相比,SAFE在保证相当或更优任务性能的同时,可降低42.85%的内存占用、34.59%的计算量和11.82%的训练时间。我们还证明SAFE能产生正则化效果,通过平滑损失函数曲面使模型避免陷入尖锐最小值,从而获得更好的泛化能力。


ARR: Question Answering with Large Language Models via Analyzing, Retrieving, and Reasoning

Abstract

arXiv:2502.04689v3 Announce Type: replace-cross Abstract: Large language models (LLMs) have demonstrated impressive capabilities on complex evaluation benchmarks, many of which are formulated as question-answering (QA) tasks. Enhancing the performance of LLMs in QA contexts is becoming increasingly vital for advancing their development and applicability. This paper introduces ARR, an intuitive, effective, and general QA solving method that explicitly incorporates three key steps: analyzing the intent of the question, retrieving relevant information, and reasoning step by step. Notably, this paper is the first to introduce intent analysis in QA, which plays a vital role in ARR. Comprehensive evaluations across 10 diverse QA tasks demonstrate that ARR consistently outperforms the baseline methods. Ablation and case studies further validate the positive contributions of each ARR component. Furthermore, experiments involving variations in prompt design indicate that ARR maintains its effectiveness regardless of the specific prompt formulation. Additionally, extensive evaluations across various model sizes, LLM series, and generation settings solidify the effectiveness, robustness, and generalizability of ARR.

摘要

大语言模型(LLMs)在复杂评估基准上展现出卓越性能,其中多数基准被构建为问答(QA)任务。提升LLMs在QA场景中的表现对其发展与适用性推进日趋关键。本文提出ARR方法——一种直观、有效且通用的QA求解方法,该方法明确整合了三个关键步骤:分析问题意图、检索相关信息及逐步推理。值得注意的是,本文首次在QA中引入意图分析,该环节对ARR具有核心作用。在10项多样化QA任务上的综合评估表明,ARR始终优于基线方法。消融研究与案例分析进一步验证了ARR各组成部分的积极贡献。此外,通过提示设计变体的实验表明,ARR在不同提示表述下均保持有效性。针对不同模型规模、LLM系列和生成设置的广泛评估,进一步巩固了ARR的有效性、鲁棒性与泛化能力。


The Lazy Student's Dream: ChatGPT Passing an Engineering Course on Its Own

Abstract

arXiv:2503.05760v3 Announce Type: replace-cross Abstract: This paper presents a comprehensive investigation into the capability of Large Language Models (LLMs) to successfully complete a semester-long undergraduate control systems course. Through evaluation of 115 course deliverables, we assess LLM performance using ChatGPT under a "minimal effort" protocol that simulates realistic student usage patterns. The investigation employs a rigorous testing methodology across multiple assessment formats, from auto-graded multiple choice questions to complex Python programming tasks and long-form analytical writing. Our analysis provides quantitative insights into AI's strengths and limitations in handling mathematical formulations, coding challenges, and theoretical concepts in control systems engineering. The LLM achieved a B-grade performance (82.24%), approaching but not exceeding the class average (84.99%), with strongest results in structured assignments and greatest limitations in open-ended projects. The findings inform discussions about course design adaptation in response to AI advancement, moving beyond simple prohibition towards thoughtful integration of these tools in engineering education. Additional materials including syllabus, examination papers, design projects, and example responses can be found at the project website: https://gradegpt.github.io.

摘要

本文针对大型语言模型(LLMs)完成本科控制系统课程整学期学习任务的能力展开全面研究。通过评估115项课程作业,我们采用"最小化努力"协议(模拟真实学生使用模式)对ChatGPT的LLM表现进行测评。研究采用严谨的测试方法覆盖多种考核形式,包括自动评分的多选题、复杂的Python编程任务以及长篇分析性写作。我们的分析从量化角度揭示了AI在控制系统工程中处理数学公式、编程挑战和理论概念的优势与局限。该LLM最终获得B级成绩(82.24%),接近但未超过班级平均分(84.99%),其在结构化作业中表现最优,而在开放性项目中局限最大。研究结果为课程设计如何应对AI发展提供了建设性讨论方向,推动工程教育从简单禁止转向对这些工具的深思熟虑整合。课程大纲、试卷、设计项目及示例回答等补充材料详见项目网站:https://gradegpt.github.io。


Implicit Bias-Like Patterns in Reasoning Models

Abstract

arXiv:2503.11572v2 Announce Type: replace-cross Abstract: Implicit bias refers to automatic mental processes that shape perceptions, judgments, and behaviors. Previous research on "implicit bias" in LLMs focused primarily on outputs rather than the processes underlying the outputs. We present the Reasoning Model Implicit Association Test (RM-IAT) to study implicit bias-like processing in reasoning models, which are LLMs using step-by-step reasoning for complex tasks. Using RM-IAT, we find o3-mini and DeepSeek R1 require more tokens when processing association-incompatible information, mirroring human implicit bias patterns. Conversely, Claude 3.7 Sonnet displays reversed patterns for race and gender tests, requiring more tokens for association-compatible information. This reversal appears linked to differences in safety mechanism activation, increasing deliberation in sensitive contexts. These findings suggest AI systems can exhibit processing patterns analogous to both human implicit bias and bias correction mechanisms.

摘要

内隐偏见是指影响感知、判断和行为的自动心理过程。先前关于大型语言模型(LLM)中'内隐偏见'的研究主要关注输出结果而非其底层过程。我们提出推理模型内隐联想测试(RM-IAT)来研究推理模型中的类内隐偏见处理机制,这类模型指通过逐步推理处理复杂任务的LLM。运用RM-IAT发现,o3-mini和DeepSeek R1在处理关联冲突信息时需要更多标记,这与人类内隐偏见模式一致。相反,Claude 3.7 Sonnet在种族和性别测试中呈现反向模式,对关联相容信息需要更多标记。这种反转现象可能与安全机制激活差异有关,其在敏感语境下会增强审慎性。这些发现表明人工智能系统可同时表现出类人类内隐偏见及偏见校正机制的处理模式。


Benchmarking Generative AI for Scoring Medical Student Interviews in Objective Structured Clinical Examinations (OSCEs)

Abstract

arXiv:2501.13957v2 Announce Type: replace-cross Abstract: Objective Structured Clinical Examinations (OSCEs) are widely used to assess medical students' communication skills, but scoring interview-based assessments is time-consuming and potentially subject to human bias. This study explored the potential of large language models (LLMs) to automate OSCE evaluations using the Master Interview Rating Scale (MIRS). We compared the performance of four state-of-the-art LLMs (GPT-4o, Claude 3.5, Llama 3.1, and Gemini 1.5 Pro) in evaluating OSCE transcripts across all 28 items of the MIRS under the conditions of zero-shot, chain-of-thought (CoT), few-shot, and multi-step prompting. The models were benchmarked against a dataset of 10 OSCE cases with 174 expert consensus scores available. Model performance was measured using three accuracy metrics (exact, off-by-one, thresholded). Averaging across all MIRS items and OSCE cases, LLMs performed with low exact accuracy (0.27 to 0.44), and moderate to high off-by-one accuracy (0.67 to 0.87) and thresholded accuracy (0.75 to 0.88). A zero temperature parameter ensured high intra-rater reliability ({\alpha} = 0.98 for GPT-4o). CoT, few-shot, and multi-step techniques proved valuable when tailored to specific assessment items. The performance was consistent across MIRS items, independent of encounter phases and communication domains. We demonstrated the feasibility of AI-assisted OSCE evaluation and provided benchmarking of multiple LLMs across multiple prompt techniques. Our work provides a baseline performance assessment for LLMs that lays a foundation for future research into automated assessment of clinical communication skills.

摘要

客观结构化临床考试(OSCE)被广泛用于评估医学生的沟通技能,但基于访谈的评分工作耗时且易受人为偏差影响。本研究探讨了利用大型语言模型(LLMs)通过主访谈评分量表(MIRS)实现OSCE评估自动化的潜力。我们比较了四种前沿LLM模型(GPT-4o、Claude 3.5、Llama 3.1和Gemini 1.5 Pro)在零样本提示、思维链(CoT)、少样本提示和多步提示条件下对MIRS全部28个条目的OSCE转录文本评估表现。模型在包含10个OSCE案例(含174份专家共识评分)的数据集上进行基准测试,采用精确匹配、相邻容错和阈值判定三种准确度指标。所有MIRS条目和OSCE案例的平均结果显示:LLMs的精确匹配准确度较低(0.27至0.44),相邻容错准确度中等偏高(0.67至0.87),阈值判定准确度较高(0.75至0.88)。零温度参数确保了较高的评分者内信度(GPT-4o的α=0.98)。针对特定评估条目定制的思维链、少样本和多步技术被证明具有应用价值。模型表现在不同MIRS条目间具有一致性,与问诊阶段和沟通领域无关。本研究证实了AI辅助OSCE评估的可行性,并对多种提示技术下的多款LLM进行了基准测试,为临床沟通技能自动化评估的未来研究建立了性能基准。


CLASH: Evaluating Language Models on Judging High-Stakes Dilemmas from Multiple Perspectives

Abstract

arXiv:2504.10823v2 Announce Type: replace-cross Abstract: Navigating high-stakes dilemmas involving conflicting values is challenging even for humans, let alone for AI. Yet prior work in evaluating the reasoning capabilities of large language models (LLMs) in such situations has been limited to everyday scenarios. To close this gap, this work first introduces CLASH (Character perspective-based LLM Assessments in Situations with High-stakes), a meticulously curated dataset consisting of 345 high-impact dilemmas along with 3,795 individual perspectives of diverse values. In particular, we design CLASH in a way to support the study of critical aspects of value-based decision-making processes which are missing from prior work, including understanding decision ambivalence and psychological discomfort as well as capturing the temporal shifts of values in characters' perspectives. By benchmarking 10 open and closed frontier models, we uncover several key findings. (1) Even the strongest models, such as GPT-4o and Claude-Sonnet, achieve less than 50% accuracy in identifying situations where the decision should be ambivalent, while they perform significantly better in clear-cut scenarios. (2) While LLMs reasonably predict psychological discomfort as marked by human, they inadequately comprehend perspectives involving value shifts, indicating a need for LLMs to reason over complex values. (3) Our experiments also reveal a significant correlation between LLMs' value preferences and their steerability towards a given value. (4) Finally, LLMs exhibit greater steerability when engaged in value reasoning from a third-party perspective, compared to a first-person setup, though certain value pairs benefit uniquely from the first-person framing.

摘要

对于人类而言,处理涉及价值观冲突的高风险困境已属不易,人工智能则面临更大挑战。然而现有研究对大型语言模型(LLMs)在此类情境中推理能力的评估仅局限于日常场景。为填补这一空白,本研究首先提出CLASH(基于角色视角的高风险情境LLM评估)——一个精心构建的数据集,包含345个高影响困境及3,795个体现多元价值观的个体视角。特别地,我们设计的CLASH支持研究先前工作中缺失的价值观决策关键维度,包括理解决策矛盾和心理不适感,以及捕捉角色视角中价值观的时序变化。通过对10个开源和闭源前沿模型的基准测试,我们得出若干重要发现:(1)即使最强模型(如GPT-4o和Claude-Sonnet)在识别应存在决策矛盾的场景时准确率不足50%,而在明确场景中表现显著更好;(2)虽然LLMs能合理预测人类标注的心理不适,但对涉及价值观转变的视角理解不足,表明其需提升复杂价值观推理能力;(3)实验揭示LLMs的价值观偏好与其对特定价值观的可引导性存在显著相关性;(4)最后,相较于第一人称设定,LLMs在第三方视角下进行价值观推理时表现出更强的可引导性,但某些特定价值观组合在第一人称框架下具有独特优势。


Data-Driven Calibration of Prediction Sets in Large Vision-Language Models Based on Inductive Conformal Prediction

Abstract

arXiv:2504.17671v3 Announce Type: replace-cross Abstract: This study addresses the critical challenge of hallucination mitigation in Large Vision-Language Models (LVLMs) for Visual Question Answering (VQA) tasks through a Split Conformal Prediction (SCP) framework. While LVLMs excel in multi-modal reasoning, their outputs often exhibit hallucinated content with high confidence, posing risks in safety-critical applications. We propose a model-agnostic uncertainty quantification method that integrates dynamic threshold calibration and cross-modal consistency verification. By partitioning data into calibration and test sets, the framework computes nonconformity scores to construct prediction sets with statistical guarantees under user-defined risk levels (α\alpha). Key innovations include: (1) rigorous control of \textbf{marginal coverage} to ensure empirical error rates remain strictly below α\alpha; (2) dynamic adjustment of prediction set sizes inversely with α\alpha, filtering low-confidence outputs; (3) elimination of prior distribution assumptions and retraining requirements. Evaluations on benchmarks (ScienceQA, MMMU) with eight LVLMs demonstrate that SCP enforces theoretical guarantees across all α\alpha values. The framework achieves stable performance across varying calibration-to-test split ratios, underscoring its robustness for real-world deployment in healthcare, autonomous systems, and other safety-sensitive domains. This work bridges the gap between theoretical reliability and practical applicability in multi-modal AI systems, offering a scalable solution for hallucination detection and uncertainty-aware decision-making.

摘要

本研究针对大型视觉语言模型(LVLMs)在视觉问答(VQA)任务中的幻觉缓解关键挑战,提出了一种基于分割共形预测(SCP)框架的解决方案。尽管LVLMs在多模态推理方面表现卓越,但其输出常伴随高置信度的幻觉内容,在安全关键应用中存在风险。我们提出了一种与模型无关的不确定性量化方法,整合动态阈值校准和跨模态一致性验证。通过将数据划分为校准集和测试集,该框架计算非共形分数以构建具有用户定义风险水平(α\alpha)下统计保证的预测集。核心创新包括:(1)严格控制\textbf{边际覆盖},确保经验错误率始终低于α\alpha;(2)根据α\alpha反向动态调整预测集规模,过滤低置信度输出;(3)无需先验分布假设和模型重训练要求。在ScienceQA、MMMU等基准测试中对八种LVLMs的评估表明,SCP在所有α\alpha值下均能强制执行理论保证。该框架在不同校准-测试分割比例下均保持稳定性能,凸显了其在医疗健康、自主系统等安全敏感领域实际应用的鲁棒性。本工作弥合了多模态AI系统理论可靠性与实际应用间的鸿沟,为幻觉检测和不确定性感知决策提供了可扩展的解决方案。


RM-R1: Reward Modeling as Reasoning

Abstract

arXiv:2505.02387v2 Announce Type: replace-cross Abstract: Reward modeling is essential for aligning large language models (LLMs) with human preferences through reinforcement learning (RL). To provide accurate reward signals, a reward model (RM) should stimulate deep thinking and conduct interpretable reasoning before assigning a score or a judgment. Inspired by recent advances of long chain-of-thought (CoT) on reasoning-intensive tasks, we hypothesize and validate that integrating reasoning capabilities into reward modeling significantly enhances RM's interpretability and performance. To this end, we introduce a new class of generative reward models -- Reasoning Reward Models (ReasRMs) -- which formulate reward modeling as a reasoning task. We propose a reasoning-oriented training pipeline and train a family of ReasRMs, RM-R1. RM-R1 features a chain-of-rubrics (CoR) mechanism -- self-generating sample-level chat rubrics or math/code solutions, and evaluating candidate responses against them. The training of M-R1 consists of two key stages: (1) distillation of high-quality reasoning chains and (2) reinforcement learning with verifiable rewards. Empirically, our models achieve state-of-the-art performance across three reward model benchmarks on average, outperforming much larger open-weight models (e.g., INF-ORM-Llama3.1-70B) and proprietary ones (e.g., GPT-4o) by up to 4.9%. Beyond final performance, we perform thorough empirical analysis to understand the key ingredients of successful ReasRM training. To facilitate future research, we release six ReasRM models along with code and data at https://github.com/RM-R1-UIUC/RM-R1.

摘要

奖励建模对于通过强化学习(RL)将大型语言模型(LLMs)与人类偏好对齐至关重要。为提供准确的奖励信号,奖励模型(RM)应在评分或判断前激发深度思考并进行可解释的推理。受长链思维(CoT)在推理密集型任务中的最新进展启发,我们提出假设并验证:将推理能力整合到奖励建模中可显著提升RM的可解释性与性能。为此,我们引入了一类新型生成式奖励模型——推理奖励模型(ReasRMs),其将奖励建模构建为推理任务。我们提出面向推理的训练流程,并训练了ReasRM系列模型RM-R1。该模型采用"规则链"(CoR)机制——自主生成样本级对话规则或数学/代码解决方案,并据此评估候选响应。RM-R1的训练包含两个关键阶段:(1)高质量推理链的蒸馏;(2)基于可验证奖励的强化学习。实验表明,我们的模型在三个奖励模型基准测试中平均达到最先进性能,以最高4.9%的优势超越更大规模的开源模型(如INF-ORM-Llama3.1-70B)和商业模型(如GPT-4o)。除最终性能外,我们还通过全面实证分析揭示了成功训练ReasRM的关键要素。为促进后续研究,我们在https://github.com/RM-R1-UIUC/RM-R1 发布了六个ReasRM模型及相关代码与数据。